You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Aditya <ad...@gmail.com> on 2017/06/22 23:43:28 UTC

Density based Clustering in Mahout

Hello everyone,

I've been working for the past few weeks on coding an in-core DBSCAN
algorithm.

A more efficient version with an O(n*log(n)) complexity does exist but it
uses the R-Tree data structure to index the data. I have a few
concerns/questions and I'm hoping you would be able to help me out.

1. Based on my knowledge, using an indexing data structure like an R-Tree
or a Kd-Tree is the only way to reduce the complexity of the dbscan
algorithm. If there's any other method that you are familiar with, please
let me know.

2. From what I've read in the book Apache Mahout: Beyond MapReduce written
by Andrew and Dmitry, I don't see how I can directly exploit the existing
data structures and operations to get the functionality of an R-Tree.

3. On the off chance that an R-Tree module has to built in Mahout, because
it is not possible to easily use existing operations I need some insights
as to how to go about it. I learned that everything in Mahout at the end
should be serializable to a vector. I can't fathom how to do that
intuitively in the case of an R-Tree

There are a couple of other concerns that need to be discussed but these
are vital at the moment.

PS: The research paper that proposed the DBSCAN algorithm can be found here
<https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.

Thanks!

-Aditya

Re: Density based Clustering in Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS Maybe we should say, if you can provide kryo serialization, it can be
assumed platform agnostic, and provide api for embedding that further. In
practice all backends (except, I guess, H20 which is going extinct if not
yet) currently support kryo, and the new potential ones could easily add it
too (after all it is just a bunch of bytes after serialization, can't get
any more basic than that).

On Thu, Jul 6, 2017 at 11:21 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
> On Thu, Jul 6, 2017 at 9:45 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
>> To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
>> structure, that will exist only within the algorithm for doing in-core
>> operations, (or maybe it lives slightly outside of the algorithm so we
>> don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
>> algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
>> Trees only exist WITHIN the in-core, i.e. we don't want to modify the
>> allReduceBlock to accept Matrices OR Trees, that will get out of hand
>> fast.  Please anyone chime in to correct me/argue against.
>>
>
> +1. that's exactly what i meant.
>
>
>> So really, we've stumbled into a more important philosophical question-
>> and
>> that is: Is it acceptable to create objects which make the internals of
>> algorithms easier to read and work with, so long as they may be serialized
>> to incore matrices/vectors? I am +1, and if it is decided this is not
>> acceptable, I need to go back and alter (or drop) things like the CanopyFn
>> [2] of the Canopy Clustering Algorithm.
>>
>
> +1 too if it is practical.
> The dilemma here is that if one wants to stay platform agnostic then the
> algorithm has to use platform-agnostic persistence/serialization, of which
> samsara provides only that of DRM/Matrix/Vector. So yes, if it is naturally
> mapping to record-tagged numerical information, it is preferable (and
> that's what i actually did a lot encoding models).
>
> In practice however of course in a particular application settings it is
> often such that people can't car less about backend compatibility, in which
> case a custom serialization is totally ok. But it in public mahout version
> it would run against the party line of staying backend agnostic so if at
> all possible with a little overhead, we try to avoid it.
>

Re: Density based Clustering in Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Thu, Jul 6, 2017 at 9:45 AM, Trevor Grant <tr...@gmail.com>
wrote:

> To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
> structure, that will exist only within the algorithm for doing in-core
> operations, (or maybe it lives slightly outside of the algorithm so we
> don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
> algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
> Trees only exist WITHIN the in-core, i.e. we don't want to modify the
> allReduceBlock to accept Matrices OR Trees, that will get out of hand
> fast.  Please anyone chime in to correct me/argue against.
>

+1. that's exactly what i meant.


> So really, we've stumbled into a more important philosophical question- and
> that is: Is it acceptable to create objects which make the internals of
> algorithms easier to read and work with, so long as they may be serialized
> to incore matrices/vectors? I am +1, and if it is decided this is not
> acceptable, I need to go back and alter (or drop) things like the CanopyFn
> [2] of the Canopy Clustering Algorithm.
>

+1 too if it is practical.
The dilemma here is that if one wants to stay platform agnostic then the
algorithm has to use platform-agnostic persistence/serialization, of which
samsara provides only that of DRM/Matrix/Vector. So yes, if it is naturally
mapping to record-tagged numerical information, it is preferable (and
that's what i actually did a lot encoding models).

In practice however of course in a particular application settings it is
often such that people can't car less about backend compatibility, in which
case a custom serialization is totally ok. But it in public mahout version
it would run against the party line of staying backend agnostic so if at
all possible with a little overhead, we try to avoid it.

Re: Density based Clustering in Mahout

Posted by Trevor Grant <tr...@gmail.com>.

To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
structure, that will exist only within the algorithm for doing in-core
operations, (or maybe it lives slightly outside of the algorithm so we
don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
Trees only exist WITHIN the in-core, i.e. we don't want to modify the
allReduceBlock to accept Matrices OR Trees, that will get out of hand
fast.  Please anyone chime in to correct me/argue against.

Aditya and I have talked offline, but the paper in question is [1]

The data-scructure that needs to be serialized is referred to as the
disjoint-set data structure.  Trees are used locally, but the disjoint-set
is the structure that is used during the reduce phase to merge.  This is
the dataset that will have to be serialized into a vector or matrix to
remain complaint with current implementation of allReduceBlock.

As far as I'm concerned- it would also be permisable to create a
disjoint-set data structure, and implement the FIND and UNION methods per
the referenced paper. The only requirement for Mahout compatability is to
be able to serialize/deserialize this structure into a vector/matrix (I'm
sure we can do this, and I will help Aditya once he makes the basic
structure)

So really, we've stumbled into a more important philosophical question- and
that is: Is it acceptable to create objects which make the internals of
algorithms easier to read and work with, so long as they may be serialized
to incore matrices/vectors? I am +1, and if it is decided this is not
acceptable, I need to go back and alter (or drop) things like the CanopyFn
[2] of the Canopy Clustering Algorithm.

[1]
http://www.ece.northwestern.edu/~choudhar/Publications/ANewScalableParallelDBSCANAlgorithmUsingDisjointSetDataStructure.pdf
[2]
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/Canopy.scala#L129

On Wed, Jul 5, 2017 at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> PS i read a few papers, including i believe that of Google's, on
> partitioning of the DBScan problem for parallelization. It did not fit my
> purposes though as they inherently assumed that every cluster problems had
> enough centroids to figure to be efficiently partitioned. In effect it
> amounted to similar effects as the probabilistic sketches techniques
> mentioned in the book, but with much more headache for the bang. I
> eventually turned to solving problems pre-sketched one way or another
> (including for density clustering problems).
>
> On Wed, Jul 5, 2017 at 8:59 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > (1) I abandoned any attempts at DBScan and implemented another density
> > algorithm itself (can't say which, subject to patent restrictions). The
> > reason being, i couldn't immediately figure how to parallelize it
> > efficiently (aside from data structure discussions), the base algorithm
> is
> > inherently iterative.
> >
> > (2) Samsara provides R-base level algebra, not general data structure
> > support. IMO it would not pay to adjust it to do that any more than
> trying
> > to fit R base to do it. I did implement spatial structures standardized
> on
> > using Samsara in terms of carrying out computations (mostly in-memory),
> but
> > those were still data structures in their on right.
> >
> > (3) like i said before, experience tells me that using collection of 2d
> > tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
> > tensor. The fundamental difference here is mostly with sparse operations
> > where n-d sparse tensor could intelligently execute those. But this is
> not
> > supported properly pretty much by any library i know, so all the
> difference
> > in most cases that say they support it directly is just putting a tensor
> > api over collection of tensors. Practicality of having dense n-d tensors
> is
> > also a bit questioned, since it immediately increases requirements for
> > processing memory quantity of a single tensor instance, whereas
> collection
> > of tensors could be represented as retaining streaming properties. etc.
> > etc. Overall it sounds to me like a solution in a search of a problem
> > (given my admittedly very limited experience as a practitioner in math).
> >
> >
> > On Wed, Jul 5, 2017 at 12:09 AM, Aditya <ad...@gmail.com>
> wrote:
> >
> >> ***Important** **Do read** *
> >> Hello everyone,
> >>
> >> Trevor and I have been discussing as to how to effectively represent an
> >> R-Tree in Mahout. Turns out there is a method to represent a Binary
> Search
> >> Tree (BST) in the form of an ancestor matrix. This
> >> <http://www.geeksforgeeks.org/construct-ancestor-matrix-from
> >> -a-given-binary-tree/>
> >> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-m
> >> atrix/>
> >> show the conversion logic from a tree representation to a matrix
> >> representation and vice versa.
> >>
> >> In a distributed scenario, I know of the following design
> >> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJc
> >> ci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
> >> which is fairly efficient and intuitive to understand.
> >>
> >> Now the point for debate is this:
> >> **Please read the design provided in the link above before proceeding**
> >> The R-Tree will always be a local entity, no where in the algorithm is
> >> there a need / necessity to have a distributed R-Tree kind of scenario.
> On
> >> the other hand, the data points as well as the union find data structure
> >> need to be stored in a DRM like data structure and they very well can be
> >> represented in the form of a matrix. (A union find data structure
> >> basically
> >> can be implemented using a vector)
> >>
> >> 1. Why not build an R-Tree module in the form of a normal tree with two
> >> children and a key-value pair? (I'm not sure if this is allowed in
> Mahout,
> >> so veterans please chip in). Since an R-tree will always be an in-core
> >> entity.
> >>
> >> 2. If 1. is not done, then the method followed for the matrix
> >> representation of a BST should be followed. Only, the elements and
> >> conditions will be changed. On an abstract sense, Matrix representation
> of
> >> an R-Tree and matrix representation of a Binary Search Tree are
> analogous.
> >> But in this case, the construction and search costs for the Matrix
> version
> >> of an R-Tree will be costlier.
> >>
> >>
> >> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if
> you
> >> could offer your insights.*
> >>
> >> Thanks,
> >> Aditya
> >>
> >>
> >> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <trevor.d.grant@gmail.com
> >
> >> wrote:
> >>
> >> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?
> >> (e.g.
> >> > 3d and 4d tensors)?
> >> >
> >> > I implemented these for the MLPs (still WIP)
> >> >
> >> > https://github.com/apache/mahout/pull/323/files#diff-
> >> > cd8a7c5e2cf42b91b5aa47c96daf19c0R25
> >> >
> >> > But those functions were specifically to overcome the challenges you
> >> > describe wrt neural network weight sets.
> >> >
> >> > If you could use those as is- that would be awesome, if not maybe
> you'll
> >> > find inspiration there.
> >> >
> >> >
> >> >
> >> > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <ad...@gmail.com>
> >> wrote:
> >> >
> >> > > Hello everyone,
> >> > >
> >> > > I've been working for the past few weeks on coding an in-core DBSCAN
> >> > > algorithm.
> >> > >
> >> > > A more efficient version with an O(n*log(n)) complexity does exist
> >> but it
> >> > > uses the R-Tree data structure to index the data. I have a few
> >> > > concerns/questions and I'm hoping you would be able to help me out.
> >> > >
> >> > > 1. Based on my knowledge, using an indexing data structure like an
> >> R-Tree
> >> > > or a Kd-Tree is the only way to reduce the complexity of the dbscan
> >> > > algorithm. If there's any other method that you are familiar with,
> >> please
> >> > > let me know.
> >> > >
> >> > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
> >> > written
> >> > > by Andrew and Dmitry, I don't see how I can directly exploit the
> >> existing
> >> > > data structures and operations to get the functionality of an
> R-Tree.
> >> > >
> >> > > 3. On the off chance that an R-Tree module has to built in Mahout,
> >> > because
> >> > > it is not possible to easily use existing operations I need some
> >> insights
> >> > > as to how to go about it. I learned that everything in Mahout at the
> >> end
> >> > > should be serializable to a vector. I can't fathom how to do that
> >> > > intuitively in the case of an R-Tree
> >> > >
> >> > > There are a couple of other concerns that need to be discussed but
> >> these
> >> > > are vital at the moment.
> >> > >
> >> > > PS: The research paper that proposed the DBSCAN algorithm can be
> found
> >> > here
> >> > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
> >> > >
> >> > > Thanks!
> >> > >
> >> > > -Aditya
> >> > >
> >> >
> >>
> >
> >
>

Re: Density based Clustering in Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS i read a few papers, including i believe that of Google's, on
partitioning of the DBScan problem for parallelization. It did not fit my
purposes though as they inherently assumed that every cluster problems had
enough centroids to figure to be efficiently partitioned. In effect it
amounted to similar effects as the probabilistic sketches techniques
mentioned in the book, but with much more headache for the bang. I
eventually turned to solving problems pre-sketched one way or another
(including for density clustering problems).

On Wed, Jul 5, 2017 at 8:59 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> (1) I abandoned any attempts at DBScan and implemented another density
> algorithm itself (can't say which, subject to patent restrictions). The
> reason being, i couldn't immediately figure how to parallelize it
> efficiently (aside from data structure discussions), the base algorithm is
> inherently iterative.
>
> (2) Samsara provides R-base level algebra, not general data structure
> support. IMO it would not pay to adjust it to do that any more than trying
> to fit R base to do it. I did implement spatial structures standardized on
> using Samsara in terms of carrying out computations (mostly in-memory), but
> those were still data structures in their on right.
>
> (3) like i said before, experience tells me that using collection of 2d
> tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
> tensor. The fundamental difference here is mostly with sparse operations
> where n-d sparse tensor could intelligently execute those. But this is not
> supported properly pretty much by any library i know, so all the difference
> in most cases that say they support it directly is just putting a tensor
> api over collection of tensors. Practicality of having dense n-d tensors is
> also a bit questioned, since it immediately increases requirements for
> processing memory quantity of a single tensor instance, whereas collection
> of tensors could be represented as retaining streaming properties. etc.
> etc. Overall it sounds to me like a solution in a search of a problem
> (given my admittedly very limited experience as a practitioner in math).
>
>
> On Wed, Jul 5, 2017 at 12:09 AM, Aditya <ad...@gmail.com> wrote:
>
>> ***Important** **Do read** *
>> Hello everyone,
>>
>> Trevor and I have been discussing as to how to effectively represent an
>> R-Tree in Mahout. Turns out there is a method to represent a Binary Search
>> Tree (BST) in the form of an ancestor matrix. This
>> <http://www.geeksforgeeks.org/construct-ancestor-matrix-from
>> -a-given-binary-tree/>
>> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-m
>> atrix/>
>> show the conversion logic from a tree representation to a matrix
>> representation and vice versa.
>>
>> In a distributed scenario, I know of the following design
>> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJc
>> ci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
>> which is fairly efficient and intuitive to understand.
>>
>> Now the point for debate is this:
>> **Please read the design provided in the link above before proceeding**
>> The R-Tree will always be a local entity, no where in the algorithm is
>> there a need / necessity to have a distributed R-Tree kind of scenario. On
>> the other hand, the data points as well as the union find data structure
>> need to be stored in a DRM like data structure and they very well can be
>> represented in the form of a matrix. (A union find data structure
>> basically
>> can be implemented using a vector)
>>
>> 1. Why not build an R-Tree module in the form of a normal tree with two
>> children and a key-value pair? (I'm not sure if this is allowed in Mahout,
>> so veterans please chip in). Since an R-tree will always be an in-core
>> entity.
>>
>> 2. If 1. is not done, then the method followed for the matrix
>> representation of a BST should be followed. Only, the elements and
>> conditions will be changed. On an abstract sense, Matrix representation of
>> an R-Tree and matrix representation of a Binary Search Tree are analogous.
>> But in this case, the construction and search costs for the Matrix version
>> of an R-Tree will be costlier.
>>
>>
>> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
>> could offer your insights.*
>>
>> Thanks,
>> Aditya
>>
>>
>> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <tr...@gmail.com>
>> wrote:
>>
>> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?
>> (e.g.
>> > 3d and 4d tensors)?
>> >
>> > I implemented these for the MLPs (still WIP)
>> >
>> > https://github.com/apache/mahout/pull/323/files#diff-
>> > cd8a7c5e2cf42b91b5aa47c96daf19c0R25
>> >
>> > But those functions were specifically to overcome the challenges you
>> > describe wrt neural network weight sets.
>> >
>> > If you could use those as is- that would be awesome, if not maybe you'll
>> > find inspiration there.
>> >
>> >
>> >
>> > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <ad...@gmail.com>
>> wrote:
>> >
>> > > Hello everyone,
>> > >
>> > > I've been working for the past few weeks on coding an in-core DBSCAN
>> > > algorithm.
>> > >
>> > > A more efficient version with an O(n*log(n)) complexity does exist
>> but it
>> > > uses the R-Tree data structure to index the data. I have a few
>> > > concerns/questions and I'm hoping you would be able to help me out.
>> > >
>> > > 1. Based on my knowledge, using an indexing data structure like an
>> R-Tree
>> > > or a Kd-Tree is the only way to reduce the complexity of the dbscan
>> > > algorithm. If there's any other method that you are familiar with,
>> please
>> > > let me know.
>> > >
>> > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
>> > written
>> > > by Andrew and Dmitry, I don't see how I can directly exploit the
>> existing
>> > > data structures and operations to get the functionality of an R-Tree.
>> > >
>> > > 3. On the off chance that an R-Tree module has to built in Mahout,
>> > because
>> > > it is not possible to easily use existing operations I need some
>> insights
>> > > as to how to go about it. I learned that everything in Mahout at the
>> end
>> > > should be serializable to a vector. I can't fathom how to do that
>> > > intuitively in the case of an R-Tree
>> > >
>> > > There are a couple of other concerns that need to be discussed but
>> these
>> > > are vital at the moment.
>> > >
>> > > PS: The research paper that proposed the DBSCAN algorithm can be found
>> > here
>> > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
>> > >
>> > > Thanks!
>> > >
>> > > -Aditya
>> > >
>> >
>>
>
>

Re: Density based Clustering in Mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

(1) I abandoned any attempts at DBScan and implemented another density
algorithm itself (can't say which, subject to patent restrictions). The
reason being, i couldn't immediately figure how to parallelize it
efficiently (aside from data structure discussions), the base algorithm is
inherently iterative.

(2) Samsara provides R-base level algebra, not general data structure
support. IMO it would not pay to adjust it to do that any more than trying
to fit R base to do it. I did implement spatial structures standardized on
using Samsara in terms of carrying out computations (mostly in-memory), but
those were still data structures in their on right.

(3) like i said before, experience tells me that using collection of 2d
tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
tensor. The fundamental difference here is mostly with sparse operations
where n-d sparse tensor could intelligently execute those. But this is not
supported properly pretty much by any library i know, so all the difference
in most cases that say they support it directly is just putting a tensor
api over collection of tensors. Practicality of having dense n-d tensors is
also a bit questioned, since it immediately increases requirements for
processing memory quantity of a single tensor instance, whereas collection
of tensors could be represented as retaining streaming properties. etc.
etc. Overall it sounds to me like a solution in a search of a problem
(given my admittedly very limited experience as a practitioner in math).


On Wed, Jul 5, 2017 at 12:09 AM, Aditya <ad...@gmail.com> wrote:

> ***Important** **Do read** *
> Hello everyone,
>
> Trevor and I have been discussing as to how to effectively represent an
> R-Tree in Mahout. Turns out there is a method to represent a Binary Search
> Tree (BST) in the form of an ancestor matrix. This
> <http://www.geeksforgeeks.org/construct-ancestor-matrix-
> from-a-given-binary-tree/>
> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-
> matrix/>
> show the conversion logic from a tree representation to a matrix
> representation and vice versa.
>
> In a distributed scenario, I know of the following design
> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4
> PyXLoNlVF0Yno8/edit?usp=sharing>
> which is fairly efficient and intuitive to understand.
>
> Now the point for debate is this:
> **Please read the design provided in the link above before proceeding**
> The R-Tree will always be a local entity, no where in the algorithm is
> there a need / necessity to have a distributed R-Tree kind of scenario. On
> the other hand, the data points as well as the union find data structure
> need to be stored in a DRM like data structure and they very well can be
> represented in the form of a matrix. (A union find data structure basically
> can be implemented using a vector)
>
> 1. Why not build an R-Tree module in the form of a normal tree with two
> children and a key-value pair? (I'm not sure if this is allowed in Mahout,
> so veterans please chip in). Since an R-tree will always be an in-core
> entity.
>
> 2. If 1. is not done, then the method followed for the matrix
> representation of a BST should be followed. Only, the elements and
> conditions will be changed. On an abstract sense, Matrix representation of
> an R-Tree and matrix representation of a Binary Search Tree are analogous.
> But in this case, the construction and search costs for the Matrix version
> of an R-Tree will be costlier.
>
>
> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
> could offer your insights.*
>
> Thanks,
> Aditya
>
>
> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?
> (e.g.
> > 3d and 4d tensors)?
> >
> > I implemented these for the MLPs (still WIP)
> >
> > https://github.com/apache/mahout/pull/323/files#diff-
> > cd8a7c5e2cf42b91b5aa47c96daf19c0R25
> >
> > But those functions were specifically to overcome the challenges you
> > describe wrt neural network weight sets.
> >
> > If you could use those as is- that would be awesome, if not maybe you'll
> > find inspiration there.
> >
> >
> >
> > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <ad...@gmail.com>
> wrote:
> >
> > > Hello everyone,
> > >
> > > I've been working for the past few weeks on coding an in-core DBSCAN
> > > algorithm.
> > >
> > > A more efficient version with an O(n*log(n)) complexity does exist but
> it
> > > uses the R-Tree data structure to index the data. I have a few
> > > concerns/questions and I'm hoping you would be able to help me out.
> > >
> > > 1. Based on my knowledge, using an indexing data structure like an
> R-Tree
> > > or a Kd-Tree is the only way to reduce the complexity of the dbscan
> > > algorithm. If there's any other method that you are familiar with,
> please
> > > let me know.
> > >
> > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
> > written
> > > by Andrew and Dmitry, I don't see how I can directly exploit the
> existing
> > > data structures and operations to get the functionality of an R-Tree.
> > >
> > > 3. On the off chance that an R-Tree module has to built in Mahout,
> > because
> > > it is not possible to easily use existing operations I need some
> insights
> > > as to how to go about it. I learned that everything in Mahout at the
> end
> > > should be serializable to a vector. I can't fathom how to do that
> > > intuitively in the case of an R-Tree
> > >
> > > There are a couple of other concerns that need to be discussed but
> these
> > > are vital at the moment.
> > >
> > > PS: The research paper that proposed the DBSCAN algorithm can be found
> > here
> > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
> > >
> > > Thanks!
> > >
> > > -Aditya
> > >
> >
>

Re: Density based Clustering in Mahout

Posted by Aditya <ad...@gmail.com>.

***Important** **Do read** *
Hello everyone,

Trevor and I have been discussing as to how to effectively represent an
R-Tree in Mahout. Turns out there is a method to represent a Binary Search
Tree (BST) in the form of an ancestor matrix. This
<http://www.geeksforgeeks.org/construct-ancestor-matrix-from-a-given-binary-tree/>
and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-matrix/>
show the conversion logic from a tree representation to a matrix
representation and vice versa.

In a distributed scenario, I know of the following design
<https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
which is fairly efficient and intuitive to understand.

Now the point for debate is this:
**Please read the design provided in the link above before proceeding**
The R-Tree will always be a local entity, no where in the algorithm is
there a need / necessity to have a distributed R-Tree kind of scenario. On
the other hand, the data points as well as the union find data structure
need to be stored in a DRM like data structure and they very well can be
represented in the form of a matrix. (A union find data structure basically
can be implemented using a vector)

1. Why not build an R-Tree module in the form of a normal tree with two
children and a key-value pair? (I'm not sure if this is allowed in Mahout,
so veterans please chip in). Since an R-tree will always be an in-core
entity.

2. If 1. is not done, then the method followed for the matrix
representation of a BST should be followed. Only, the elements and
conditions will be changed. On an abstract sense, Matrix representation of
an R-Tree and matrix representation of a Binary Search Tree are analogous.
But in this case, the construction and search costs for the Matrix version
of an R-Tree will be costlier.

*PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
could offer your insights.*

Thanks,
Aditya

On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <tr...@gmail.com>
wrote:

> What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?  (e.g.
> 3d and 4d tensors)?
>
> I implemented these for the MLPs (still WIP)
>
> https://github.com/apache/mahout/pull/323/files#diff-
> cd8a7c5e2cf42b91b5aa47c96daf19c0R25
>
> But those functions were specifically to overcome the challenges you
> describe wrt neural network weight sets.
>
> If you could use those as is- that would be awesome, if not maybe you'll
> find inspiration there.
>
>
>
> On Thu, Jun 22, 2017 at 6:43 PM, Aditya <ad...@gmail.com> wrote:
>
> > Hello everyone,
> >
> > I've been working for the past few weeks on coding an in-core DBSCAN
> > algorithm.
> >
> > A more efficient version with an O(n*log(n)) complexity does exist but it
> > uses the R-Tree data structure to index the data. I have a few
> > concerns/questions and I'm hoping you would be able to help me out.
> >
> > 1. Based on my knowledge, using an indexing data structure like an R-Tree
> > or a Kd-Tree is the only way to reduce the complexity of the dbscan
> > algorithm. If there's any other method that you are familiar with, please
> > let me know.
> >
> > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
> written
> > by Andrew and Dmitry, I don't see how I can directly exploit the existing
> > data structures and operations to get the functionality of an R-Tree.
> >
> > 3. On the off chance that an R-Tree module has to built in Mahout,
> because
> > it is not possible to easily use existing operations I need some insights
> > as to how to go about it. I learned that everything in Mahout at the end
> > should be serializable to a vector. I can't fathom how to do that
> > intuitively in the case of an R-Tree
> >
> > There are a couple of other concerns that need to be discussed but these
> > are vital at the moment.
> >
> > PS: The research paper that proposed the DBSCAN algorithm can be found
> here
> > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
> >
> > Thanks!
> >
> > -Aditya
> >
>

Re: Density based Clustering in Mahout

Posted by Trevor Grant <tr...@gmail.com>.

What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?  (e.g.
3d and 4d tensors)?

I implemented these for the MLPs (still WIP)

https://github.com/apache/mahout/pull/323/files#diff-cd8a7c5e2cf42b91b5aa47c96daf19c0R25

But those functions were specifically to overcome the challenges you
describe wrt neural network weight sets.

If you could use those as is- that would be awesome, if not maybe you'll
find inspiration there.



On Thu, Jun 22, 2017 at 6:43 PM, Aditya <ad...@gmail.com> wrote:

> Hello everyone,
>
> I've been working for the past few weeks on coding an in-core DBSCAN
> algorithm.
>
> A more efficient version with an O(n*log(n)) complexity does exist but it
> uses the R-Tree data structure to index the data. I have a few
> concerns/questions and I'm hoping you would be able to help me out.
>
> 1. Based on my knowledge, using an indexing data structure like an R-Tree
> or a Kd-Tree is the only way to reduce the complexity of the dbscan
> algorithm. If there's any other method that you are familiar with, please
> let me know.
>
> 2. From what I've read in the book Apache Mahout: Beyond MapReduce written
> by Andrew and Dmitry, I don't see how I can directly exploit the existing
> data structures and operations to get the functionality of an R-Tree.
>
> 3. On the off chance that an R-Tree module has to built in Mahout, because
> it is not possible to easily use existing operations I need some insights
> as to how to go about it. I learned that everything in Mahout at the end
> should be serializable to a vector. I can't fathom how to do that
> intuitively in the case of an R-Tree
>
> There are a couple of other concerns that need to be discussed but these
> are vital at the moment.
>
> PS: The research paper that proposed the DBSCAN algorithm can be found here
> <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
>
> Thanks!
>
> -Aditya
>