You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Manolis Gemeliaris <ge...@gmail.com> on 2022/05/05 17:17:31 UTC

An online kmeans algorithm for Spark

Hello everyone on the Dev team of Apache Spark.

My name is Manolis Gemeliaris and I am a student at the Hellenic
Mediterranean University (former TEI of Crete). For my thesis project I
would like to add an online kmeans algorithm (paper
<https://arxiv.org/abs/1412.5721> (Edo Liberty et al) and python
implementation <https://github.com/sviri/kmeans/tree/main/onlineKmeans/src>
(by the authors)) to Apache Spark.
As I have already read it is a really big procedure to get something like
this officially accepted and it can take a long time to achieve. So I would
like to do it as an Open Source 3rd party package instead, that would be
compatible with  Apache Spark 3.
I have already read the contribution guidelines for Spark and taken some
time studying the code on github.

I would like to ask if anyone can find the time to help me get started. Of
course I realize that your time is of importance, so just any tips that you
can share would be greatly appreciated.

Thank you in advance,
Best Regards,
Manolis Gemeliaris