You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by 顾荣 <gu...@gmail.com> on 2012/09/20 14:39:28 UTC

Does Hama Graph provides any file reader interface during running time ?

Hi, guys.

As you are calling for some application programs on Hama in the *Future
Plans* of the Hama programming wiki here (
https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf),
I am so interested in machine learning. I have a plan to implement neural
networks (eg.Multilayer Perceptron with BP) on Hama. Hama seems to be a
nice tool for training large scale neural networks. Esepcailly, for those
with large scale structure (many hidden layers and many neurons), I find
Hama Graph provided a good solution. We can regard each neuron in NN(neural
network) as a vertex in Hama Graph, and the links between neurons as eages
in the Graph. Then, the training process can be regarded as updating the
weights of the eages among vetices. However, I encounted a problem in the
current Hama Graph implementation.

Let me explain this to you. As you maybe now, during the training process
of many machine learning algorithms, we need to input many training samples
into the model one by one. Usaually, more training samples will lead to
preciser models. However, as far as I know, the only input file interface
provided by the Hama Graph is the input for graph structure. Sadly, it's
hard to read the distribute the training samples during running time, as
users can only make their computing logics by overriding the some key
functions such as compute() int the Vetex class. So, does hama graph
provide any flexible file reading interface for users in running time?

Thanks in advance.

Walker.

Re: Does Hama Graph provides any file reader interface during running time ?

Posted by 顾荣 <gu...@gmail.com>.

Hi Thomas,

I have not upload cNeural to web yet, for I wrote the code tightly coupled
with the configurtaion of the Hadoop and HBase running on our lab cluster.
This is orginally a experiment for scientific research purpose. As you
suggested, I will orgianze the code and upload it to the web. Now, I am
also planing to implment my alogrithm in Hama BSP. It's suitable.

I will send message to you in this mail session, if there is more progess
later:)

Nice to talk to you.

Walker

2012/9/20 Thomas Jungblut <th...@gmail.com>

> Hey walker,
>
> cool thing, can you share a link to your cNeural library?
> Martin is working on GPU with Hama in relation to Hama pipes (
> https://issues.apache.org/jira/browse/HAMA-619).
> He wants to go the native way, but personally I made pretty good
> experiences with JCUDA, I used it for my neural net implementation for
> large input neurons in image recognition tasks.
> However that's just a part of making the matrix multiplications faster
> which usually takes most of the time.
>
> The HBase storage is really interesting, thanks for sharing!
>
> 2012/9/20 顾荣 <gu...@gmail.com>
>
> > Hi, Thomas.
> >
> > I read your blog and github on the information about training NN on Hama
> > several days ago. I am agree with you on this topic for my experiences on
> > implementing NN in a distribute way.
> > That happens when I did not know Hama project. Thus, I implemented a
> > customized distribuited system for training NN with large scale training
> > data myself, the system is called cNeural.
> > It is basically fellows a Master/Slave archtecture.I adopted Hadoop RPC
> > for communication and HBase for storing large scale training dataset,
> and I
> > used a batch-mode the BP tranining algorithm.
> > BTW, HBase is very suitable for store traning data sets for machine
> > learning. No matter how large a traning data set is, a HTable can easily
> > store it across many regionservers.
> > Each traning sample can be stored as a record in HTable, even it's sparse
> > coded. Furtherly, HBase provide random access to your training sample. In
> > my experience,
> > it's much better to store the structured data in Hbase than directly in
> > HDFS.
> >
> > Back to this topic, as you mentioned, I can read training data directlly
> > from HDFS through HDFS API, during the setup stage of the vertex.
> > I also considered this and know how to use HDFS API long ago, thanks for
> > hint anyway:)
> > However, I am afraid of that it may cost quite a lot of time, because for
> > a large sacle NN with thousands of neurons,
> > each neuron vertex almost simutanluously reads the same traning sample
> > would cost a lot of network traffic
> > and put too much stress on HDFS. What's more, it seems unnecessary. I
> > planed to select a master vector
> > responsible for reading samples for HDFS, and intialize each input neuro
> > by sending the feature value to this vertex.
> > However, even though I can do this, there are a lot more tough problems
> to
> > solve, such as partition. As you said, to
> > conrol this training workflow in a distributed way is too complex. And
> > with so many network communication and distribute
> > synchronization, it will be much slower than the sequential programe
> > executed on a single machine. In a word,
> > this tough distribution wil probably leads to no improvment but slower
> > speed and high complexity. As you talk about for high
> > dimensionalities, I suggest to use GPU to handle this. Distribution may
> > not be a good solution in this case. Of course, we
> > can combine GPU with Hama, and it's necessary in the near future, I
> > believe.
> >
> > As I have mentioned at the beginning of this mail. I implemented cNeural,
> > and I also compare cNeural with Hadoop for sloving this problem.
> > The experiment results can be find in the attachment of this mail. In
> > general, cNeural adopted a parallel strategy like BSP model. So, I am
> about
> >  to reimplement cNeural on Hama BSP. I learned Hama Graph this week, and
> > just come across a thought of implementing NN on Hama Graph,
> > considered about this case, and asked this question. I am agree with you
> > on your analysis.
> >
> > Regards,
> > Walker.
> >
> >
> >
> > 2012/9/20 Thomas Jungblut <th...@gmail.com>
> >
> >> Hi,
> >>
> >> nice idea, but I'm certainly unsure if the graph module really fits your
> >> needs.
> >> In Backprop you need to set the input to different neurons in your input
> >> layer and you have to forwardpropagate these until you reach the output
> >> layer. Calculating the error from this single step in your architecture
> >> would consume many supersteps. This is totally inefficient in my
> >> opinion, but let's just take this thought away.
> >>
> >> Assuming you have an n by m matrix which contains your whole trainingset
> >> and in the m-th column there is the outcome of the previous features.
> >> A input vertex should have the ability to read a row of the
> corresponding
> >> column vector from the trainingset and the output neurons need to do the
> >> same.
> >> Good news, you can do this by reading a file within the setup function
> of
> >> a
> >> vertex or by reading it line by line when compute is called. You can
> >> access
> >> filesystems with the Hadoop DFS API pretty easily. Just type it into
> your
> >> favourite search engines, it is just called FileSystem and you can get
> it
> >> by using FileSystem.get(Configuration conf).
> >>
> >> Now here is my experience with a raw BSP and neural networks if you
> >> consider this against the graph module:
> >> - partition the neurons horizontally (through the layers) not by the
> >> layers
> >> - weights mustbe averaged across multiple tasks
> >>
> >> I came for myself to conclude that it is fairly better to implement a
> >> function optimizer with raw BSP to train the weights (a simple
> >> StochasticGradientDescent totally works out for almost every normal
> >> usecase
> >> if your network has a convex costfunction).
> >> Of course this doesn't work out well for higher dimensionalities, but
> more
> >> data usually wins, even with simpler models. At the end you can always
> >> boost it anyway.
> >>
> >> I will of course support you on this if you like, I'm fairly certain
> that
> >> your way can work, but will be slow as hell.
> >> Just my usual two cents on various topics ;)
> >>
> >> 2012/9/20 顾荣 <gu...@gmail.com>
> >>
> >> > Hi, guys.
> >> >
> >> > As you are calling for some application programs on Hama in the
> *Future
> >> > Plans* of the Hama programming wiki here (
> >> >
> >> >
> >>
> https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf
> >> > ),
> >> > I am so interested in machine learning. I have a plan to implement
> >> neural
> >> > networks (eg.Multilayer Perceptron with BP) on Hama. Hama seems to be
> a
> >> > nice tool for training large scale neural networks. Esepcailly, for
> >> those
> >> > with large scale structure (many hidden layers and many neurons), I
> find
> >> > Hama Graph provided a good solution. We can regard each neuron in
> >> NN(neural
> >> > network) as a vertex in Hama Graph, and the links between neurons as
> >> eages
> >> > in the Graph. Then, the training process can be regarded as updating
> the
> >> > weights of the eages among vetices. However, I encounted a problem in
> >> the
> >> > current Hama Graph implementation.
> >> >
> >> > Let me explain this to you. As you maybe now, during the training
> >> process
> >> > of many machine learning algorithms, we need to input many training
> >> samples
> >> > into the model one by one. Usaually, more training samples will lead
> to
> >> > preciser models. However, as far as I know, the only input file
> >> interface
> >> > provided by the Hama Graph is the input for graph structure. Sadly,
> it's
> >> > hard to read the distribute the training samples during running time,
> as
> >> > users can only make their computing logics by overriding the some key
> >> > functions such as compute() int the Vetex class. So, does hama graph
> >> > provide any flexible file reading interface for users in running time?
> >> >
> >> > Thanks in advance.
> >> >
> >> > Walker.
> >> >
> >>
> >
> >
>

Re: Does Hama Graph provides any file reader interface during running time ?

Posted by Thomas Jungblut <th...@gmail.com>.

Hey walker,

cool thing, can you share a link to your cNeural library?
Martin is working on GPU with Hama in relation to Hama pipes (
https://issues.apache.org/jira/browse/HAMA-619).
He wants to go the native way, but personally I made pretty good
experiences with JCUDA, I used it for my neural net implementation for
large input neurons in image recognition tasks.
However that's just a part of making the matrix multiplications faster
which usually takes most of the time.

The HBase storage is really interesting, thanks for sharing!

2012/9/20 顾荣 <gu...@gmail.com>

> Hi, Thomas.
>
> I read your blog and github on the information about training NN on Hama
> several days ago. I am agree with you on this topic for my experiences on
> implementing NN in a distribute way.
> That happens when I did not know Hama project. Thus, I implemented a
> customized distribuited system for training NN with large scale training
> data myself, the system is called cNeural.
> It is basically fellows a Master/Slave archtecture.I adopted Hadoop RPC
> for communication and HBase for storing large scale training dataset, and I
> used a batch-mode the BP tranining algorithm.
> BTW, HBase is very suitable for store traning data sets for machine
> learning. No matter how large a traning data set is, a HTable can easily
> store it across many regionservers.
> Each traning sample can be stored as a record in HTable, even it's sparse
> coded. Furtherly, HBase provide random access to your training sample. In
> my experience,
> it's much better to store the structured data in Hbase than directly in
> HDFS.
>
> Back to this topic, as you mentioned, I can read training data directlly
> from HDFS through HDFS API, during the setup stage of the vertex.
> I also considered this and know how to use HDFS API long ago, thanks for
> hint anyway:)
> However, I am afraid of that it may cost quite a lot of time, because for
> a large sacle NN with thousands of neurons,
> each neuron vertex almost simutanluously reads the same traning sample
> would cost a lot of network traffic
> and put too much stress on HDFS. What's more, it seems unnecessary. I
> planed to select a master vector
> responsible for reading samples for HDFS, and intialize each input neuro
> by sending the feature value to this vertex.
> However, even though I can do this, there are a lot more tough problems to
> solve, such as partition. As you said, to
> conrol this training workflow in a distributed way is too complex. And
> with so many network communication and distribute
> synchronization, it will be much slower than the sequential programe
> executed on a single machine. In a word,
> this tough distribution wil probably leads to no improvment but slower
> speed and high complexity. As you talk about for high
> dimensionalities, I suggest to use GPU to handle this. Distribution may
> not be a good solution in this case. Of course, we
> can combine GPU with Hama, and it's necessary in the near future, I
> believe.
>
> As I have mentioned at the beginning of this mail. I implemented cNeural,
> and I also compare cNeural with Hadoop for sloving this problem.
> The experiment results can be find in the attachment of this mail. In
> general, cNeural adopted a parallel strategy like BSP model. So, I am about
>  to reimplement cNeural on Hama BSP. I learned Hama Graph this week, and
> just come across a thought of implementing NN on Hama Graph,
> considered about this case, and asked this question. I am agree with you
> on your analysis.
>
> Regards,
> Walker.
>
>
>
> 2012/9/20 Thomas Jungblut <th...@gmail.com>
>
>> Hi,
>>
>> nice idea, but I'm certainly unsure if the graph module really fits your
>> needs.
>> In Backprop you need to set the input to different neurons in your input
>> layer and you have to forwardpropagate these until you reach the output
>> layer. Calculating the error from this single step in your architecture
>> would consume many supersteps. This is totally inefficient in my
>> opinion, but let's just take this thought away.
>>
>> Assuming you have an n by m matrix which contains your whole trainingset
>> and in the m-th column there is the outcome of the previous features.
>> A input vertex should have the ability to read a row of the corresponding
>> column vector from the trainingset and the output neurons need to do the
>> same.
>> Good news, you can do this by reading a file within the setup function of
>> a
>> vertex or by reading it line by line when compute is called. You can
>> access
>> filesystems with the Hadoop DFS API pretty easily. Just type it into your
>> favourite search engines, it is just called FileSystem and you can get it
>> by using FileSystem.get(Configuration conf).
>>
>> Now here is my experience with a raw BSP and neural networks if you
>> consider this against the graph module:
>> - partition the neurons horizontally (through the layers) not by the
>> layers
>> - weights mustbe averaged across multiple tasks
>>
>> I came for myself to conclude that it is fairly better to implement a
>> function optimizer with raw BSP to train the weights (a simple
>> StochasticGradientDescent totally works out for almost every normal
>> usecase
>> if your network has a convex costfunction).
>> Of course this doesn't work out well for higher dimensionalities, but more
>> data usually wins, even with simpler models. At the end you can always
>> boost it anyway.
>>
>> I will of course support you on this if you like, I'm fairly certain that
>> your way can work, but will be slow as hell.
>> Just my usual two cents on various topics ;)
>>
>> 2012/9/20 顾荣 <gu...@gmail.com>
>>
>> > Hi, guys.
>> >
>> > As you are calling for some application programs on Hama in the *Future
>> > Plans* of the Hama programming wiki here (
>> >
>> >
>> https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf
>> > ),
>> > I am so interested in machine learning. I have a plan to implement
>> neural
>> > networks (eg.Multilayer Perceptron with BP) on Hama. Hama seems to be a
>> > nice tool for training large scale neural networks. Esepcailly, for
>> those
>> > with large scale structure (many hidden layers and many neurons), I find
>> > Hama Graph provided a good solution. We can regard each neuron in
>> NN(neural
>> > network) as a vertex in Hama Graph, and the links between neurons as
>> eages
>> > in the Graph. Then, the training process can be regarded as updating the
>> > weights of the eages among vetices. However, I encounted a problem in
>> the
>> > current Hama Graph implementation.
>> >
>> > Let me explain this to you. As you maybe now, during the training
>> process
>> > of many machine learning algorithms, we need to input many training
>> samples
>> > into the model one by one. Usaually, more training samples will lead to
>> > preciser models. However, as far as I know, the only input file
>> interface
>> > provided by the Hama Graph is the input for graph structure. Sadly, it's
>> > hard to read the distribute the training samples during running time, as
>> > users can only make their computing logics by overriding the some key
>> > functions such as compute() int the Vetex class. So, does hama graph
>> > provide any flexible file reading interface for users in running time?
>> >
>> > Thanks in advance.
>> >
>> > Walker.
>> >
>>
>
>

Re: Does Hama Graph provides any file reader interface during running time ?

Posted by 顾荣 <gu...@gmail.com>.

Hi, Thomas.

I read your blog and github on the information about training NN on Hama
several days ago. I am agree with you on this topic for my experiences on
implementing NN in a distribute way.
That happens when I did not know Hama project. Thus, I implemented a
customized distribuited system for training NN with large scale training
data myself, the system is called cNeural.
It is basically fellows a Master/Slave archtecture.I adopted Hadoop RPC for
communication and HBase for storing large scale training dataset, and I
used a batch-mode the BP tranining algorithm.
BTW, HBase is very suitable for store traning data sets for machine
learning. No matter how large a traning data set is, a HTable can easily
store it across many regionservers.
Each traning sample can be stored as a record in HTable, even it's sparse
coded. Furtherly, HBase provide random access to your training sample. In
my experience,
it's much better to store the structured data in Hbase than directly in
HDFS.

Back to this topic, as you mentioned, I can read training data directlly
from HDFS through HDFS API, during the setup stage of the vertex.
I also considered this and know how to use HDFS API long ago, thanks for
hint anyway:)
However, I am afraid of that it may cost quite a lot of time, because for a
large sacle NN with thousands of neurons,
each neuron vertex almost simutanluously reads the same traning sample
would cost a lot of network traffic
and put too much stress on HDFS. What's more, it seems unnecessary. I
planed to select a master vector
responsible for reading samples for HDFS, and intialize each input neuro by
sending the feature value to this vertex.
However, even though I can do this, there are a lot more tough problems to
solve, such as partition. As you said, to
conrol this training workflow in a distributed way is too complex. And with
so many network communication and distribute
synchronization, it will be much slower than the sequential programe
executed on a single machine. In a word,
this tough distribution wil probably leads to no improvment but slower
speed and high complexity. As you talk about for high
dimensionalities, I suggest to use GPU to handle this. Distribution may not
be a good solution in this case. Of course, we
can combine GPU with Hama, and it's necessary in the near future, I believe.

As I have mentioned at the beginning of this mail. I implemented cNeural,
and I also compare cNeural with Hadoop for sloving this problem.
The experiment results can be find in the attachment of this mail. In
general, cNeural adopted a parallel strategy like BSP model. So, I am about
 to reimplement cNeural on Hama BSP. I learned Hama Graph this week, and
just come across a thought of implementing NN on Hama Graph,
considered about this case, and asked this question. I am agree with you on
your analysis.

Regards,
Walker.


2012/9/20 Thomas Jungblut <th...@gmail.com>

> Hi,
>
> nice idea, but I'm certainly unsure if the graph module really fits your
> needs.
> In Backprop you need to set the input to different neurons in your input
> layer and you have to forwardpropagate these until you reach the output
> layer. Calculating the error from this single step in your architecture
> would consume many supersteps. This is totally inefficient in my
> opinion, but let's just take this thought away.
>
> Assuming you have an n by m matrix which contains your whole trainingset
> and in the m-th column there is the outcome of the previous features.
> A input vertex should have the ability to read a row of the corresponding
> column vector from the trainingset and the output neurons need to do the
> same.
> Good news, you can do this by reading a file within the setup function of a
> vertex or by reading it line by line when compute is called. You can access
> filesystems with the Hadoop DFS API pretty easily. Just type it into your
> favourite search engines, it is just called FileSystem and you can get it
> by using FileSystem.get(Configuration conf).
>
> Now here is my experience with a raw BSP and neural networks if you
> consider this against the graph module:
> - partition the neurons horizontally (through the layers) not by the layers
> - weights mustbe averaged across multiple tasks
>
> I came for myself to conclude that it is fairly better to implement a
> function optimizer with raw BSP to train the weights (a simple
> StochasticGradientDescent totally works out for almost every normal usecase
> if your network has a convex costfunction).
> Of course this doesn't work out well for higher dimensionalities, but more
> data usually wins, even with simpler models. At the end you can always
> boost it anyway.
>
> I will of course support you on this if you like, I'm fairly certain that
> your way can work, but will be slow as hell.
> Just my usual two cents on various topics ;)
>
> 2012/9/20 顾荣 <gu...@gmail.com>
>
> > Hi, guys.
> >
> > As you are calling for some application programs on Hama in the *Future
> > Plans* of the Hama programming wiki here (
> >
> >
> https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf
> > ),
> > I am so interested in machine learning. I have a plan to implement neural
> > networks (eg.Multilayer Perceptron with BP) on Hama. Hama seems to be a
> > nice tool for training large scale neural networks. Esepcailly, for those
> > with large scale structure (many hidden layers and many neurons), I find
> > Hama Graph provided a good solution. We can regard each neuron in
> NN(neural
> > network) as a vertex in Hama Graph, and the links between neurons as
> eages
> > in the Graph. Then, the training process can be regarded as updating the
> > weights of the eages among vetices. However, I encounted a problem in the
> > current Hama Graph implementation.
> >
> > Let me explain this to you. As you maybe now, during the training process
> > of many machine learning algorithms, we need to input many training
> samples
> > into the model one by one. Usaually, more training samples will lead to
> > preciser models. However, as far as I know, the only input file interface
> > provided by the Hama Graph is the input for graph structure. Sadly, it's
> > hard to read the distribute the training samples during running time, as
> > users can only make their computing logics by overriding the some key
> > functions such as compute() int the Vetex class. So, does hama graph
> > provide any flexible file reading interface for users in running time?
> >
> > Thanks in advance.
> >
> > Walker.
> >
>

Re: Does Hama Graph provides any file reader interface during running time ?

Posted by Thomas Jungblut <th...@gmail.com>.

Hi,

nice idea, but I'm certainly unsure if the graph module really fits your
needs.
In Backprop you need to set the input to different neurons in your input
layer and you have to forwardpropagate these until you reach the output
layer. Calculating the error from this single step in your architecture
would consume many supersteps. This is totally inefficient in my
opinion, but let's just take this thought away.

Assuming you have an n by m matrix which contains your whole trainingset
and in the m-th column there is the outcome of the previous features.
A input vertex should have the ability to read a row of the corresponding
column vector from the trainingset and the output neurons need to do the
same.
Good news, you can do this by reading a file within the setup function of a
vertex or by reading it line by line when compute is called. You can access
filesystems with the Hadoop DFS API pretty easily. Just type it into your
favourite search engines, it is just called FileSystem and you can get it
by using FileSystem.get(Configuration conf).

Now here is my experience with a raw BSP and neural networks if you
consider this against the graph module:
- partition the neurons horizontally (through the layers) not by the layers
- weights mustbe averaged across multiple tasks

I came for myself to conclude that it is fairly better to implement a
function optimizer with raw BSP to train the weights (a simple
StochasticGradientDescent totally works out for almost every normal usecase
if your network has a convex costfunction).
Of course this doesn't work out well for higher dimensionalities, but more
data usually wins, even with simpler models. At the end you can always
boost it anyway.

I will of course support you on this if you like, I'm fairly certain that
your way can work, but will be slow as hell.
Just my usual two cents on various topics ;)

2012/9/20 顾荣 <gu...@gmail.com>

> Hi, guys.
>
> As you are calling for some application programs on Hama in the *Future
> Plans* of the Hama programming wiki here (
>
> https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf
> ),
> I am so interested in machine learning. I have a plan to implement neural
> networks (eg.Multilayer Perceptron with BP) on Hama. Hama seems to be a
> nice tool for training large scale neural networks. Esepcailly, for those
> with large scale structure (many hidden layers and many neurons), I find
> Hama Graph provided a good solution. We can regard each neuron in NN(neural
> network) as a vertex in Hama Graph, and the links between neurons as eages
> in the Graph. Then, the training process can be regarded as updating the
> weights of the eages among vetices. However, I encounted a problem in the
> current Hama Graph implementation.
>
> Let me explain this to you. As you maybe now, during the training process
> of many machine learning algorithms, we need to input many training samples
> into the model one by one. Usaually, more training samples will lead to
> preciser models. However, as far as I know, the only input file interface
> provided by the Hama Graph is the input for graph structure. Sadly, it's
> hard to read the distribute the training samples during running time, as
> users can only make their computing logics by overriding the some key
> functions such as compute() int the Vetex class. So, does hama graph
> provide any flexible file reading interface for users in running time?
>
> Thanks in advance.
>
> Walker.
>