You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by durin <ma...@simon-schaefer.net> on 2014/07/11 10:53:19 UTC

KMeans for large training data

Hi,

I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic
clustering with Strings.

My code works great when I use a five-figure amount of training elements.
However, with for example 2 million elements, it gets extremely slow. A
single stage may take up to 30 minutes.

>From the Web UI, I can see that it does these three things repeatedly:


All of these tasks only use one executor, and on that executor only one
core. And I can see a scheduler delay of about 25 seconds.

I tried to use broadcast variables to speed this up, but maybe I'm using it
wrong. The relevant code (where it gets slow) is this:




What could I do to use more executors, and generally speed this up? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans for large training data

Posted by durin <ma...@simon-schaefer.net>.

Thanks, setting the number of partitions to the number of executors helped a
lot and training with 20k entries got a lot faster.

However, when I tried training with 1M entries, after about 45 minutes of
calculations, I get this:



It's stuck at this point. The CPU load for the master is at 100% (so 1 of 8
cores is used), but the WebUI shows no active task, and after 30 more
minutes of no visible change I had to leave for an appointment.
I've never seen an error referring to this library before. Could that be due
to the new partitioning?

Edit: Just before sending, in a new test I realized this error also appears
when the amount of testdata is very low (here 500 items). This time it
includes a Java stacktrace though, instead of just stopping:



So, to sum it up, KMeans.train works somewhere inbetween 10k and 200k items,
but not outside this range. Can you think of an explanation for this
behavior?


Best regards,
Simon



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407p9508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans for large training data

Posted by Sean Owen <so...@cloudera.com>.

On Fri, Jul 11, 2014 at 7:32 PM, durin <ma...@simon-schaefer.net> wrote:
> How would you get more partitions?

You can specify this as the second arg to methods that read your data
originally, like:
sc.textFile("...", 20)

> I ran broadcastVector.value.repartition(5), but
> broadcastVector.value.partitions.size is still 1 and no change to the
> behavior is visible.

These are immutable, so to have effect you have to do something like:
val repartitioned = broadcastVector.value.repartition(5)


> First of all, there is a gap of almost two minutes between the third to last
> and second to last line, where no activity is shown in the WebUI. Is that
> the GC at work? If yes, how would I improve this?

You mean there are a few minutes where no job is running? I assume
that's time when the driver is busy doing something. Is it thrashing?


> Also, "Local KMeans++ reached the max number of iterations: 30" surprises
> me. I have ran training using
>
> is it possible that somehow, there are still 30 iterations executed, despite
> of the 3 I set?

Are you sure you set 3 iterations?

Re: KMeans for large training data

Posted by durin <ma...@simon-schaefer.net>.

Hi Sean, thanks for you reply.

How would you get more partitions?
I ran broadcastVector.value.repartition(5), but
broadcastVector.value.partitions.size is still 1 and no change to the
behavior is visible.

Also, I noticed this:


First of all, there is a gap of almost two minutes between the third to last
and second to last line, where no activity is shown in the WebUI. Is that
the GC at work? If yes, how would I improve this?

Also, "Local KMeans++ reached the max number of iterations: 30" surprises
me. I have ran training using 

is it possible that somehow, there are still 30 iterations executed, despite
of the 3 I set?


Best regards,
Simon



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407p9431.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans for large training data

Posted by Sean Owen <so...@cloudera.com>.

How many partitions do you use for your data? if the default is 1, you
probably need to manually ask for more partitions.

Also, I'd check that your executors aren't thrashing close to the GC
limit. This can make things start to get very slow.

On Fri, Jul 11, 2014 at 9:53 AM, durin <ma...@simon-schaefer.net> wrote:
> Hi,
>
> I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic
> clustering with Strings.
>
> My code works great when I use a five-figure amount of training elements.
> However, with for example 2 million elements, it gets extremely slow. A
> single stage may take up to 30 minutes.
>
> From the Web UI, I can see that it does these three things repeatedly:
>
>
> All of these tasks only use one executor, and on that executor only one
> core. And I can see a scheduler delay of about 25 seconds.
>
> I tried to use broadcast variables to speed this up, but maybe I'm using it
> wrong. The relevant code (where it gets slow) is this:
>
>
>
>
> What could I do to use more executors, and generally speed this up?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans for large training data

Posted by Aaron Davidson <il...@gmail.com>.

The "netlib.BLAS: Failed to load implementation" warning only means that
the BLAS implementation may be slower than using a native one. The reason
why it only shows up at the end is that the library is only used for the
finalization step of the KMeans algorithm, so your job should've been
wrapping up at this point. I am not familiar with the algorithm beyond
that, so I'm not sure if for some reason we're trying to collect too much
data back to the driver here.

SPARK_DRIVER_MEMORY can increase the driver memory, by the way (or by using
the --driver-memory option when using spark-submit).

On Sat, Jul 12, 2014 at 2:38 AM, durin <ma...@simon-schaefer.net> wrote:

> Your latest response doesn't show up here yet, I only got the mail. I'll
> still answer here in the hope that it appears later:
>
> Which memory setting do you mean? I can go up with spark.executor.memory a
> bit, it's currently set to 12G. But thats already way more than the whole
> SchemaRDD of Vectors that I currently use for training, which shouldn't be
> more than a few hundred M.
> I suppose you rather mean something comparable to SHARK_MASTER_MEM in
> Shark.
> I can't find the equivalent for Spark in the documentations, though.
>
> And if it helps, I can summarize the whole code currently that I currently
> use. It's nothing really fancy at the moment, I'm just trying to classify
> Strings that each contain a few words (words are handled each as atomic
> items).
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407p9509.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: KMeans for large training data

Posted by durin <ma...@simon-schaefer.net>.

Your latest response doesn't show up here yet, I only got the mail. I'll
still answer here in the hope that it appears later:

Which memory setting do you mean? I can go up with spark.executor.memory a
bit, it's currently set to 12G. But thats already way more than the whole
SchemaRDD of Vectors that I currently use for training, which shouldn't be
more than a few hundred M.
I suppose you rather mean something comparable to SHARK_MASTER_MEM in Shark.
I can't find the equivalent for Spark in the documentations, though.

And if it helps, I can summarize the whole code currently that I currently
use. It's nothing really fancy at the moment, I'm just trying to classify
Strings that each contain a few words (words are handled each as atomic
items).



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407p9509.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.