You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Panos Mandros <ma...@gmail.com> on 2012/10/10 20:40:10 UTC

Is Apache Hama suitable for building a decision tree?

I currently have implemented in Hadoop, Google's framework for building
decision trees (also known as PLANET). It is supposed to scale well in
very large datasets. But it has many problems. It scales only well if
the dataset has a few attributes. If a dataset has a lot of attributes,
that means it will have a lot of map/reduce jobs which means a big
start-up cost for all of these jobs. Google however uses it with a lot
of modifications on its Hadoop like platform and not on the algorithm
itself. PLANET starts with a single vertex and with map reduce jobs you
add more and more until the tree is fully build.

I have seen many times that Apache Hama is suitable for iterative
algorithms like graphs. Can someone build a new graph with Hama or you
just have as input a graph and make some computations on it? Will it be
easy to transfer my project to Hama?? Thanks

Re: Is Apache Hama suitable for building a decision tree?

Posted by Thomas Jungblut <th...@gmail.com>.

So we talked about this and want to get it through within a few weeks, so
stay tuned. I will add a jira for that soon.

2012/10/11 Thomas Jungblut <th...@gmail.com>

> Yes that is great, I will help you with that.
>
>
> 2012/10/11 Panos Mandros <ma...@gmail.com>
>
>> Hey Thomas,
>>     implementing PLANET was part of my bachelor thesis. It works not only
>> for single label learning but for multi-label learning also, as this is
>> one
>> of the areas my professor is interested. It works fine but still has
>> things
>> that need to be done. One of these is to transfer it to Hama. Another
>> thing
>> is to find a more efficient way to transfer data from mappers to the
>> reducer because right now the output is really big. If you want we can
>> cooperate on this.
>>
>> 2012/10/10 Thomas Jungblut <th...@gmail.com>
>>
>> > Hey Panos,
>> >
>> > thanks for transferring this.
>> >
>> > Here is the paper for the others:
>> >
>> >
>> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//pubs/archive/36296.pdf
>> >
>> > I wanted to do this, not enough time :/
>> > As I said on stackoverflow, I think the graph package is the wrong
>> approach
>> > here, you can clearly translate the mapreduce algorithm to BSP
>> > and make use of the faster iterations.
>> >
>> > Do you already have the code in MapReduce? I can simply turn this into
>> BSP.
>> > I would like to support the creation of random forests as well, by
>> training
>> > a decision tree in every task and combining them later.
>> >
>> >
>> > 2012/10/10 Panos Mandros <ma...@gmail.com>
>> >
>> > > I currently have implemented in Hadoop, Google's framework for
>> building
>> > > decision trees (also known as PLANET). It is supposed to scale well in
>> > > very large datasets. But it has many problems. It scales only well if
>> > > the dataset has a few attributes. If a dataset has a lot of
>> attributes,
>> > > that means it will have a lot of map/reduce jobs which means a big
>> > > start-up cost for all of these jobs. Google however uses it with a lot
>> > > of modifications on its Hadoop like platform and not on the algorithm
>> > > itself. PLANET starts with a single vertex and with map reduce jobs
>> you
>> > > add more and more until the tree is fully build.
>> > >
>> > > I have seen many times that Apache Hama is suitable for iterative
>> > > algorithms like graphs. Can someone build a new graph with Hama or you
>> > > just have as input a graph and make some computations on it? Will it
>> be
>> > > easy to transfer my project to Hama?? Thanks
>> > >
>> >
>>
>
>

Re: Is Apache Hama suitable for building a decision tree?

Posted by Thomas Jungblut <th...@gmail.com>.

Yes that is great, I will help you with that.

2012/10/11 Panos Mandros <ma...@gmail.com>

> Hey Thomas,
>     implementing PLANET was part of my bachelor thesis. It works not only
> for single label learning but for multi-label learning also, as this is one
> of the areas my professor is interested. It works fine but still has things
> that need to be done. One of these is to transfer it to Hama. Another thing
> is to find a more efficient way to transfer data from mappers to the
> reducer because right now the output is really big. If you want we can
> cooperate on this.
>
> 2012/10/10 Thomas Jungblut <th...@gmail.com>
>
> > Hey Panos,
> >
> > thanks for transferring this.
> >
> > Here is the paper for the others:
> >
> >
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//pubs/archive/36296.pdf
> >
> > I wanted to do this, not enough time :/
> > As I said on stackoverflow, I think the graph package is the wrong
> approach
> > here, you can clearly translate the mapreduce algorithm to BSP
> > and make use of the faster iterations.
> >
> > Do you already have the code in MapReduce? I can simply turn this into
> BSP.
> > I would like to support the creation of random forests as well, by
> training
> > a decision tree in every task and combining them later.
> >
> >
> > 2012/10/10 Panos Mandros <ma...@gmail.com>
> >
> > > I currently have implemented in Hadoop, Google's framework for building
> > > decision trees (also known as PLANET). It is supposed to scale well in
> > > very large datasets. But it has many problems. It scales only well if
> > > the dataset has a few attributes. If a dataset has a lot of attributes,
> > > that means it will have a lot of map/reduce jobs which means a big
> > > start-up cost for all of these jobs. Google however uses it with a lot
> > > of modifications on its Hadoop like platform and not on the algorithm
> > > itself. PLANET starts with a single vertex and with map reduce jobs you
> > > add more and more until the tree is fully build.
> > >
> > > I have seen many times that Apache Hama is suitable for iterative
> > > algorithms like graphs. Can someone build a new graph with Hama or you
> > > just have as input a graph and make some computations on it? Will it be
> > > easy to transfer my project to Hama?? Thanks
> > >
> >
>

Re: Is Apache Hama suitable for building a decision tree?

Posted by Panos Mandros <ma...@gmail.com>.

Hey Thomas,
    implementing PLANET was part of my bachelor thesis. It works not only
for single label learning but for multi-label learning also, as this is one
of the areas my professor is interested. It works fine but still has things
that need to be done. One of these is to transfer it to Hama. Another thing
is to find a more efficient way to transfer data from mappers to the
reducer because right now the output is really big. If you want we can
cooperate on this.

2012/10/10 Thomas Jungblut <th...@gmail.com>

> Hey Panos,
>
> thanks for transferring this.
>
> Here is the paper for the others:
>
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//pubs/archive/36296.pdf
>
> I wanted to do this, not enough time :/
> As I said on stackoverflow, I think the graph package is the wrong approach
> here, you can clearly translate the mapreduce algorithm to BSP
> and make use of the faster iterations.
>
> Do you already have the code in MapReduce? I can simply turn this into BSP.
> I would like to support the creation of random forests as well, by training
> a decision tree in every task and combining them later.
>
>
> 2012/10/10 Panos Mandros <ma...@gmail.com>
>
> > I currently have implemented in Hadoop, Google's framework for building
> > decision trees (also known as PLANET). It is supposed to scale well in
> > very large datasets. But it has many problems. It scales only well if
> > the dataset has a few attributes. If a dataset has a lot of attributes,
> > that means it will have a lot of map/reduce jobs which means a big
> > start-up cost for all of these jobs. Google however uses it with a lot
> > of modifications on its Hadoop like platform and not on the algorithm
> > itself. PLANET starts with a single vertex and with map reduce jobs you
> > add more and more until the tree is fully build.
> >
> > I have seen many times that Apache Hama is suitable for iterative
> > algorithms like graphs. Can someone build a new graph with Hama or you
> > just have as input a graph and make some computations on it? Will it be
> > easy to transfer my project to Hama?? Thanks
> >
>

Re: Is Apache Hama suitable for building a decision tree?

Posted by Thomas Jungblut <th...@gmail.com>.

Hey Panos,

thanks for transferring this.

Here is the paper for the others:
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//pubs/archive/36296.pdf

I wanted to do this, not enough time :/
As I said on stackoverflow, I think the graph package is the wrong approach
here, you can clearly translate the mapreduce algorithm to BSP
and make use of the faster iterations.

Do you already have the code in MapReduce? I can simply turn this into BSP.
I would like to support the creation of random forests as well, by training
a decision tree in every task and combining them later.


2012/10/10 Panos Mandros <ma...@gmail.com>

> I currently have implemented in Hadoop, Google's framework for building
> decision trees (also known as PLANET). It is supposed to scale well in
> very large datasets. But it has many problems. It scales only well if
> the dataset has a few attributes. If a dataset has a lot of attributes,
> that means it will have a lot of map/reduce jobs which means a big
> start-up cost for all of these jobs. Google however uses it with a lot
> of modifications on its Hadoop like platform and not on the algorithm
> itself. PLANET starts with a single vertex and with map reduce jobs you
> add more and more until the tree is fully build.
>
> I have seen many times that Apache Hama is suitable for iterative
> algorithms like graphs. Can someone build a new graph with Hama or you
> just have as input a graph and make some computations on it? Will it be
> easy to transfer my project to Hama?? Thanks
>

Is Apache Hama suitable for building a decision tree?

Posted by Panos Mandros <ma...@gmail.com>.

I currently have implemented in Hadoop, Google's framework for building
decision trees (also known as PLANET). It is supposed to scale well in
very large datasets. But it has many problems. It scales only well if
the dataset has a few attributes. If a dataset has a lot of attributes,
that means it will have a lot of map/reduce jobs which means a big
start-up cost for all of these jobs. Google however uses it with a lot
of modifications on its Hadoop like platform and not on the algorithm
itself. PLANET starts with a single vertex and with map reduce jobs you
add more and more until the tree is fully build.

I have seen many times that Apache Hama is suitable for iterative
algorithms like graphs. Can someone build a new graph with Hama or you
just have as input a graph and make some computations on it? Will it be
easy to transfer my project to Hama?? Thanks