You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pedro Oliveira <cp...@gmail.com> on 2010/05/05 14:26:38 UTC

Multi-relational data

Hi,

I have a simple question: does Mahout supports, or plans to support,
multi-relational datasets?
I.e., datasets where each instance can have a variable number of values in a
attribute, and values can be other instances?
The basic example is a social network, where each person has several
attributes, and some attributes, like "knows", can have several distinct
values, and these values are other persons.
This datasets are usually very sparse (there's lots of distinct attributes,
but each instance only has values for few of them), and the relational
information is very relevant (in the social network example, the
acquaintances of our acquaintances are relevant).


Cheers,
Pedro

Re: Multi-relational data

Posted by Pedro Oliveira <cp...@gmail.com>.

Thank you all for the pointers. This is a subject that I'll probably have to
explore in a few weeks, and your guys help is much appreciated. I'll keep in
touch if something interesting comes from this work.

Cheers,
Pedro


On Wed, May 5, 2010 at 11:16 AM, Ted Dunning <te...@gmail.com> wrote:

> We already support sparse vectors and matrices.  That should be pretty much
> all you need.
>
> There is emerging support for SVM and on-line logistic regression.  A
> little
> less mature is support for very large scale SVD which would give you a
> reasonable basis for clustering, or categorization.
>
> On Wed, May 5, 2010 at 6:29 AM, Pedro Oliveira <cp...@gmail.com> wrote:
>
> > From a quick look at the code, a straightforward solution would be to
> > define
> > a new type of Vector (it wouldn't be a vector in the mathematical sense,
> > just a way to save relational information about an instance), and some
> > DistanceMeasures to work with that vector. Then we could use distance
> based
> > techniques, such as canopy clustering and k-means.
> > Is there any plans to implement more distance-based (or kernel-based)
> > algorithms, such as SVMs and KNN?
> >
>

Re: Multi-relational data

Posted by Ted Dunning <te...@gmail.com>.

We already support sparse vectors and matrices.  That should be pretty much
all you need.

There is emerging support for SVM and on-line logistic regression.  A little
less mature is support for very large scale SVD which would give you a
reasonable basis for clustering, or categorization.

On Wed, May 5, 2010 at 6:29 AM, Pedro Oliveira <cp...@gmail.com> wrote:

> From a quick look at the code, a straightforward solution would be to
> define
> a new type of Vector (it wouldn't be a vector in the mathematical sense,
> just a way to save relational information about an instance), and some
> DistanceMeasures to work with that vector. Then we could use distance based
> techniques, such as canopy clustering and k-means.
> Is there any plans to implement more distance-based (or kernel-based)
> algorithms, such as SVMs and KNN?
>

Re: Multi-relational data

Posted by Jake Mannix <ja...@gmail.com>.

Hi Pedro,

  Representing data points as semi-structured entities (and then doing
clustering, classification, or content-based recommendation) is certainly
something on the horizon for Mahout (see
MAHOUT-274<https://issues.apache.org/jira/browse/MAHOUT-274>for the
start
of the thinking around representing semi-structured entities, but search
for content-based recommendations in the list archives to see the
discussions about that).  These are in the early works, however, and
I would bet this doesn't get much support until the end of the summer
or into the autumn.

  SVM support is on the way: see
MAHOUT-232<https://issues.apache.org/jira/browse/MAHOUT-232>,
MAHOUT-237 <https://issues.apache.org/jira/browse/MAHOUT-237> and
MAHOUT-227 <https://issues.apache.org/jira/browse/MAHOUT-227>, and should be
fairly scalable by the end of the summer.

  -jake

On Wed, May 5, 2010 at 6:29 AM, Pedro Oliveira <cp...@gmail.com> wrote:

> Hi,
>
> On Wed, May 5, 2010 at 8:41 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > You might have to be more specific. Support this is in the context of
> > what, recommendations, clustering, ?
> >
>
> Classification, clustering, and recommendation are the most important ones.
>
>
> >
> > You can probably fit such concepts into any framework with enough
> > cleverness, so in that sense, as a general framework, sure I don't see
> > why any algorithm couldn't eventually be applied to such data.
> >
> > This is a fairly specific kind of data model, so I am not sure if it
> > would be something explicit supported in some special way.
> >
>
> I'm currently working on a system that implements several non-parametric
> machine learning techniques to work with multi-relational data (K-Medoids,
> KNN, etc), and it works quite nicely with data that fits in memory.
> However,
> I have some new huge datasets, and I'll probably need to use some kind of
> parallelization, and Mahout seems a good solution. The main purpose of my
> email was to see if there's someone else out there working in the same
> thing
> as I.
> From a quick look at the code, a straightforward solution would be to
> define
> a new type of Vector (it wouldn't be a vector in the mathematical sense,
> just a way to save relational information about an instance), and some
> DistanceMeasures to work with that vector. Then we could use distance based
> techniques, such as canopy clustering and k-means.
> Is there any plans to implement more distance-based (or kernel-based)
> algorithms, such as SVMs and KNN?
>
> Cheers,
> Pedro
>
>
>
> >
> >
> > On Wed, May 5, 2010 at 1:26 PM, Pedro Oliveira <cp...@gmail.com>
> wrote:
> > > Hi,
> > >
> > > I have a simple question: does Mahout supports, or plans to support,
> > > multi-relational datasets?
> > > I.e., datasets where each instance can have a variable number of values
> > in a
> > > attribute, and values can be other instances?
> > > The basic example is a social network, where each person has several
> > > attributes, and some attributes, like "knows", can have several
> distinct
> > > values, and these values are other persons.
> > > This datasets are usually very sparse (there's lots of distinct
> > attributes,
> > > but each instance only has values for few of them), and the relational
> > > information is very relevant (in the social network example, the
> > > acquaintances of our acquaintances are relevant).
> > >
> > >
> > > Cheers,
> > > Pedro
> > >
> >
>

Re: Multi-relational data

Posted by Pedro Oliveira <cp...@gmail.com>.

Hi,

On Wed, May 5, 2010 at 8:41 AM, Sean Owen <sr...@gmail.com> wrote:

> You might have to be more specific. Support this is in the context of
> what, recommendations, clustering, ?
>

Classification, clustering, and recommendation are the most important ones.


>
> You can probably fit such concepts into any framework with enough
> cleverness, so in that sense, as a general framework, sure I don't see
> why any algorithm couldn't eventually be applied to such data.
>
> This is a fairly specific kind of data model, so I am not sure if it
> would be something explicit supported in some special way.
>

I'm currently working on a system that implements several non-parametric
machine learning techniques to work with multi-relational data (K-Medoids,
KNN, etc), and it works quite nicely with data that fits in memory. However,
I have some new huge datasets, and I'll probably need to use some kind of
parallelization, and Mahout seems a good solution. The main purpose of my
email was to see if there's someone else out there working in the same thing
as I.
>From a quick look at the code, a straightforward solution would be to define
a new type of Vector (it wouldn't be a vector in the mathematical sense,
just a way to save relational information about an instance), and some
DistanceMeasures to work with that vector. Then we could use distance based
techniques, such as canopy clustering and k-means.
Is there any plans to implement more distance-based (or kernel-based)
algorithms, such as SVMs and KNN?

Cheers,
Pedro



>
>
> On Wed, May 5, 2010 at 1:26 PM, Pedro Oliveira <cp...@gmail.com> wrote:
> > Hi,
> >
> > I have a simple question: does Mahout supports, or plans to support,
> > multi-relational datasets?
> > I.e., datasets where each instance can have a variable number of values
> in a
> > attribute, and values can be other instances?
> > The basic example is a social network, where each person has several
> > attributes, and some attributes, like "knows", can have several distinct
> > values, and these values are other persons.
> > This datasets are usually very sparse (there's lots of distinct
> attributes,
> > but each instance only has values for few of them), and the relational
> > information is very relevant (in the social network example, the
> > acquaintances of our acquaintances are relevant).
> >
> >
> > Cheers,
> > Pedro
> >
>

Re: Multi-relational data

Posted by Sean Owen <sr...@gmail.com>.

You might have to be more specific. Support this is in the context of
what, recommendations, clustering, ?

You can probably fit such concepts into any framework with enough
cleverness, so in that sense, as a general framework, sure I don't see
why any algorithm couldn't eventually be applied to such data.

This is a fairly specific kind of data model, so I am not sure if it
would be something explicit supported in some special way.

On Wed, May 5, 2010 at 1:26 PM, Pedro Oliveira <cp...@gmail.com> wrote:
> Hi,
>
> I have a simple question: does Mahout supports, or plans to support,
> multi-relational datasets?
> I.e., datasets where each instance can have a variable number of values in a
> attribute, and values can be other instances?
> The basic example is a social network, where each person has several
> attributes, and some attributes, like "knows", can have several distinct
> values, and these values are other persons.
> This datasets are usually very sparse (there's lots of distinct attributes,
> but each instance only has values for few of them), and the relational
> information is very relevant (in the social network example, the
> acquaintances of our acquaintances are relevant).
>
>
> Cheers,
> Pedro
>