You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aliaksei Litouka <al...@gmail.com> on 2014/06/12 23:31:15 UTC

An attempt to implement dbscan algorithm on top of Spark

Hi.
I'm not sure if messages like this are appropriate in this list; I just
want to share with you an application I am working on. This is my personal
project which I started to learn more about Spark and Scala, and, if it
succeeds, to contribute it to the Spark community.

Maybe someone will find it useful. Or maybe someone will want to join
development.

The application is available at https://github.com/alitouka/spark_dbscan

Any questions, comments, suggestions, as well as criticism are welcome :)

Best regards,
Aliaksei Litouka

Re: An attempt to implement dbscan algorithm on top of Spark

Posted by Aliaksei Litouka <al...@gmail.com>.

Vipul,
Thanks for your feedback. As far as I understand, mean RDD[(Double,
Double)] (note the parenthesis), and each of these Double values is
supposed to contain one coordinate of a point. It limits us to
2-dimensional space, which is not suitable for many tasks. I want the
algorithm to be able to work in multidimensional space. Actually, there is
a class org.alitouka.spark.dbscan.spatial.Point in my code, which
represents a point with an arbitrary number of coordinates.

IOHelper.readDataset is just a convenience method which reads a CSV file
and returns an RDD of Points (more precisely, it returns a value of type
RawDataset, which is just an alias for RDD[Point]). If your data is stored
in a format other than CSV, you will have to write your own code to convert
your data to RawDataset.

I can add support for other data formats in future versions.

As for other distance measures - it is a high priority issue in my list ;)

On Thu, Jun 12, 2014 at 6:02 PM, Vipul Pandey <vi...@gmail.com> wrote:

> Great! I was going to implement one of my own - but I may not need to do
> that any more :)
> I haven't had a chance to look deep into your code but I would recommend
> accepting an RDD[Double,Double] as well, instead of just a file.
>
> val data = IOHelper.readDataset(sc, "/path/to/my/data.csv")
>
> And other distance measures ofcourse.
>
> Thanks,
> Vipul
>
>
>
>
> On Jun 12, 2014, at 2:31 PM, Aliaksei Litouka <al...@gmail.com>
> wrote:
>
> Hi.
> I'm not sure if messages like this are appropriate in this list; I just
> want to share with you an application I am working on. This is my personal
> project which I started to learn more about Spark and Scala, and, if it
> succeeds, to contribute it to the Spark community.
>
> Maybe someone will find it useful. Or maybe someone will want to join
> development.
>
> The application is available at https://github.com/alitouka/spark_dbscan
>
> Any questions, comments, suggestions, as well as criticism are welcome :)
>
> Best regards,
> Aliaksei Litouka
>
>
>

Re: An attempt to implement dbscan algorithm on top of Spark

Posted by Vipul Pandey <vi...@gmail.com>.

Great! I was going to implement one of my own - but I may not need to do that any more :)
I haven't had a chance to look deep into your code but I would recommend accepting an RDD[Double,Double] as well, instead of just a file. 
val data = IOHelper.readDataset(sc, "/path/to/my/data.csv")
And other distance measures ofcourse. 

Thanks,
Vipul

On Jun 12, 2014, at 2:31 PM, Aliaksei Litouka <al...@gmail.com> wrote:

> Hi.
> I'm not sure if messages like this are appropriate in this list; I just want to share with you an application I am working on. This is my personal project which I started to learn more about Spark and Scala, and, if it succeeds, to contribute it to the Spark community.
> 
> Maybe someone will find it useful. Or maybe someone will want to join development.
> 
> The application is available at https://github.com/alitouka/spark_dbscan
> 
> Any questions, comments, suggestions, as well as criticism are welcome :)
> 
> Best regards,
> Aliaksei Litouka