You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Zakaria Hili <za...@gmail.com> on 2016/11/08 15:33:57 UTC

Issue + Resolution: Kmeans Spark Performances (ML package)

Hello,

I'm newbie in spark, but I think that I found a small problem that can
affect spark Kmeans performances.
Before starting to explain the problem, I want to explain the warning that
I faced.

I tried to use Spark Kmeans with Dataframes to cluster my data

df_Part = assembler.transform(df_Part)
df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):
                    kmeans = KMeans().setK(k)
                    model = kmeans.fit(df_Part)
                    wssse = model.computeCost(df_Part)
                    k=k+1

but when I run the code I receive the warning :
WARN KMeans: The input data is not directly cached, which may hurt
performance if its parent RDDs are also uncached.

I searched in spark source code to find the source of this problem, then I
realized there is two classes responsible for this warning:

(mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )

(mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )



When my  dataframe is cached, the fit method transform my dataframe into an
internally rdd which is not cached.

Dataframe -> rdd -> run Training Kmeans Algo(rdd)


-> The first class (ml package) responsible for converting the dataframe
into rdd then call Kmeans Algorithm

->The second class (mllib package) implements Kmeans Algorithm, and here
spark verify if the rdd is cached, if not a warning will be generated.


So, the solution of this (small) problem is to cache the rdd before running
Kmeans Algorithm.

https://github.com/ZakariaHili/spark/blob/master/mllib/src/
main/scala/org/apache/spark/ml/clustering/KMeans.scala

All what we need is to add two lines:

Cache rdd just after dataframe transformation, then uncached it after
training algorithm


[image: Images intégrées 2]


I hope that I was clear.

If you think that I was wrong, please let me know.


Sincerely,

Zakaria HILI
ᐧ

Re: Issue + Resolution: Kmeans Spark Performances (ML package)

Posted by Joseph Bradley <jo...@databricks.com>.
Hi Zakaria,

Caching the DataFrame can "solve the problem" in terms of efficiency in
many cases.  It is true it will not eliminate the warning message.  I'd be
fine adding the RDD caching, if you're interested in creating a JIRA and
sending a PR.

Joseph

On Tue, Nov 8, 2016 at 10:50 AM, Zakaria Hili <za...@gmail.com> wrote:

> Hi Joseph,
> Thank you for your reply, but caching the dataframe (input) doesn't
> resolve the problem,
> As you can see in my piece of code, I already cached my dataframe
> <df_Part.cache()>, but Internally ml.clustering.KMeans converts DataFrame
> to RDD[mllib.linalg.Vector].
> Then, Executes mllib.clustering.KMeans.
> So, while you cache DataFrame, RDD which is used internally is not cached.
>
> Regards
> Zakaria
>
> 2016-11-08 19:40 GMT+01:00 Joseph Bradley <jo...@databricks.com>:
>
>> Hi Zakaria,
>>
>> Thanks for reporting this.  This actually isn't too big of an issue since
>> the user can cache the Dataset passed to spark.ml.clustering.KMeans before
>> fitting, but the message does mislead the user.  A patch to fix that
>> message would be good, though this issue should be fixed before too long as
>> we move more implementations from spark.mllib to spark.ml.
>>
>> Thanks!
>> Joseph
>>
>> On Tue, Nov 8, 2016 at 7:33 AM, Zakaria Hili <za...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I'm newbie in spark, but I think that I found a small problem that can
>>> affect spark Kmeans performances.
>>> Before starting to explain the problem, I want to explain the warning
>>> that I faced.
>>>
>>> I tried to use Spark Kmeans with Dataframes to cluster my data
>>>
>>> df_Part = assembler.transform(df_Part)
>>> df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):
>>>                     kmeans = KMeans().setK(k)
>>>                     model = kmeans.fit(df_Part)
>>>                     wssse = model.computeCost(df_Part)
>>>                     k=k+1
>>>
>>> but when I run the code I receive the warning :
>>> WARN KMeans: The input data is not directly cached, which may hurt
>>> performance if its parent RDDs are also uncached.
>>>
>>> I searched in spark source code to find the source of this problem, then
>>> I realized there is two classes responsible for this warning:
>>>
>>> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
>>>
>>> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>>>
>>>
>>>
>>> When my  dataframe is cached, the fit method transform my dataframe into
>>> an internally rdd which is not cached.
>>>
>>> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
>>>
>>>
>>> -> The first class (ml package) responsible for converting the dataframe
>>> into rdd then call Kmeans Algorithm
>>>
>>> ->The second class (mllib package) implements Kmeans Algorithm, and here
>>> spark verify if the rdd is cached, if not a warning will be generated.
>>>
>>>
>>> So, the solution of this (small) problem is to cache the rdd before
>>> running Kmeans Algorithm.
>>>
>>> https://github.com/ZakariaHili/spark/blob/master/mllib/src/m
>>> ain/scala/org/apache/spark/ml/clustering/KMeans.scala
>>>
>>> All what we need is to add two lines:
>>>
>>> Cache rdd just after dataframe transformation, then uncached it after
>>> training algorithm
>>>
>>>
>>> [image: Images intégrées 2]
>>>
>>>
>>> I hope that I was clear.
>>>
>>> If you think that I was wrong, please let me know.
>>>
>>>
>>> Sincerely,
>>>
>>> Zakaria HILI
>>> ᐧ
>>>
>>
>>
>
>
> --
> Zakaria HILI
>
> Ingénieur Big Data
>
> Practice CSD | Capgemini | Toulouse | Bureau EJ4-11
>
> Tél. : +33(0)7 53 65 36 85
>
> www.capgemini.com
>
> 109, avenue du Général Eisenhower - 31036 Toulouse Cedex 1 – France
>
> - Emails :
>
> -ZakariaHILI@Gmail.com
>
> -Zakaria.Hili@Capgemini.com
>
> -Zakaria.hili@Enseirb-matmeca.fr
>

Re: Issue + Resolution: Kmeans Spark Performances (ML package)

Posted by Zakaria Hili <za...@gmail.com>.
Hi Joseph,
Thank you for your reply, but caching the dataframe (input) doesn't resolve
the problem,
As you can see in my piece of code, I already cached my dataframe
<df_Part.cache()>, but Internally ml.clustering.KMeans converts DataFrame
to RDD[mllib.linalg.Vector].
Then, Executes mllib.clustering.KMeans.
So, while you cache DataFrame, RDD which is used internally is not cached.

Regards
Zakaria

2016-11-08 19:40 GMT+01:00 Joseph Bradley <jo...@databricks.com>:

> Hi Zakaria,
>
> Thanks for reporting this.  This actually isn't too big of an issue since
> the user can cache the Dataset passed to spark.ml.clustering.KMeans before
> fitting, but the message does mislead the user.  A patch to fix that
> message would be good, though this issue should be fixed before too long as
> we move more implementations from spark.mllib to spark.ml.
>
> Thanks!
> Joseph
>
> On Tue, Nov 8, 2016 at 7:33 AM, Zakaria Hili <za...@gmail.com> wrote:
>
>> Hello,
>>
>> I'm newbie in spark, but I think that I found a small problem that can
>> affect spark Kmeans performances.
>> Before starting to explain the problem, I want to explain the warning
>> that I faced.
>>
>> I tried to use Spark Kmeans with Dataframes to cluster my data
>>
>> df_Part = assembler.transform(df_Part)
>> df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):
>>                     kmeans = KMeans().setK(k)
>>                     model = kmeans.fit(df_Part)
>>                     wssse = model.computeCost(df_Part)
>>                     k=k+1
>>
>> but when I run the code I receive the warning :
>> WARN KMeans: The input data is not directly cached, which may hurt
>> performance if its parent RDDs are also uncached.
>>
>> I searched in spark source code to find the source of this problem, then
>> I realized there is two classes responsible for this warning:
>>
>> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
>>
>> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>>
>>
>>
>> When my  dataframe is cached, the fit method transform my dataframe into
>> an internally rdd which is not cached.
>>
>> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
>>
>>
>> -> The first class (ml package) responsible for converting the dataframe
>> into rdd then call Kmeans Algorithm
>>
>> ->The second class (mllib package) implements Kmeans Algorithm, and here
>> spark verify if the rdd is cached, if not a warning will be generated.
>>
>>
>> So, the solution of this (small) problem is to cache the rdd before
>> running Kmeans Algorithm.
>>
>> https://github.com/ZakariaHili/spark/blob/master/mllib/src/m
>> ain/scala/org/apache/spark/ml/clustering/KMeans.scala
>>
>> All what we need is to add two lines:
>>
>> Cache rdd just after dataframe transformation, then uncached it after
>> training algorithm
>>
>>
>> [image: Images intégrées 2]
>>
>>
>> I hope that I was clear.
>>
>> If you think that I was wrong, please let me know.
>>
>>
>> Sincerely,
>>
>> Zakaria HILI
>> ᐧ
>>
>
>


-- 
Zakaria HILI

Ingénieur Big Data

Practice CSD | Capgemini | Toulouse | Bureau EJ4-11

Tél. : +33(0)7 53 65 36 85

www.capgemini.com

109, avenue du Général Eisenhower - 31036 Toulouse Cedex 1 – France

- Emails :

-ZakariaHILI@Gmail.com

-Zakaria.Hili@Capgemini.com

-Zakaria.hili@Enseirb-matmeca.fr

Re: Issue + Resolution: Kmeans Spark Performances (ML package)

Posted by Joseph Bradley <jo...@databricks.com>.
Hi Zakaria,

Thanks for reporting this.  This actually isn't too big of an issue since
the user can cache the Dataset passed to spark.ml.clustering.KMeans before
fitting, but the message does mislead the user.  A patch to fix that
message would be good, though this issue should be fixed before too long as
we move more implementations from spark.mllib to spark.ml.

Thanks!
Joseph

On Tue, Nov 8, 2016 at 7:33 AM, Zakaria Hili <za...@gmail.com> wrote:

> Hello,
>
> I'm newbie in spark, but I think that I found a small problem that can
> affect spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that
> I faced.
>
> I tried to use Spark Kmeans with Dataframes to cluster my data
>
> df_Part = assembler.transform(df_Part)
> df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):
>                     kmeans = KMeans().setK(k)
>                     model = kmeans.fit(df_Part)
>                     wssse = model.computeCost(df_Part)
>                     k=k+1
>
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt
> performance if its parent RDDs are also uncached.
>
> I searched in spark source code to find the source of this problem, then I
> realized there is two classes responsible for this warning:
>
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
>
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>
>
>
> When my  dataframe is cached, the fit method transform my dataframe into
> an internally rdd which is not cached.
>
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
>
>
> -> The first class (ml package) responsible for converting the dataframe
> into rdd then call Kmeans Algorithm
>
> ->The second class (mllib package) implements Kmeans Algorithm, and here
> spark verify if the rdd is cached, if not a warning will be generated.
>
>
> So, the solution of this (small) problem is to cache the rdd before
> running Kmeans Algorithm.
>
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/m
> ain/scala/org/apache/spark/ml/clustering/KMeans.scala
>
> All what we need is to add two lines:
>
> Cache rdd just after dataframe transformation, then uncached it after
> training algorithm
>
>
> [image: Images intégrées 2]
>
>
> I hope that I was clear.
>
> If you think that I was wrong, please let me know.
>
>
> Sincerely,
>
> Zakaria HILI
> ᐧ
>