You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@horn.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2015/12/04 11:30:25 UTC

[DISCUSS] CPU-only for large NN

Hi forks,

Instead of compete (performance) with GPU-based deep learning
projects, I realized that we need to focus more on CPU cluster and
optimization.

For instance, we can provide the easy way to design large-scale neural
network with more intuitive programming interface. Then, it also can
be used for reducing the model size with pruning techniques, so that
it can be fitted in GPU memory (if neural network is getting more and
more sparse by pruning, CPU might have the advantage over GPU).

WDYT?

-- 
Best Regards, Edward J. Yoon

Re: [DISCUSS] CPU-only for large NN

Posted by Zachary Jaffee <zi...@case.edu>.

I agree, trying to compete with the many GPU-oriented projects will be very
difficult. So for now, I agree building out the CPU cluster makes more
sense than other alternatives.

On Sun, Dec 6, 2015 at 6:43 PM, Edward J. Yoon <ed...@samsung.com>
wrote:

> Good insights :-)
>
> Basically, it's hard to expect GPU would speed-up the sparse NN bc they're
> optimized for dense mat-mult. To regularize NN, there're drop-out or
> pruning
> techniques as you know. So, CPU-based Apache Horn can be more flexible and
> useful for both data and model parallel than others I think.
>
> However, if high-performance and optimized system is required in real-world
> applications, GPU might better. So, I (originally) mean that we need to
> focus
> more on flexible and scalable CPU cluster, instead of compete with
> GPU-oriented projects.
>
> --
> Best Regards, Edward J. Yoon
>
> -----Original Message-----
> From: Zachary Jaffee [mailto:zij@case.edu]
> Sent: Saturday, December 05, 2015 5:38 AM
> To: Unknown
> Subject: Re: [DISCUSS] CPU-only for large NN
>
> Isn't it the case that GPU computing is better for matrix multiplication
> and other heavy mathematical workloads, meaning that we would want to
> eventually incorporate it in some way. However, I do think that a more
> sparse neural network will see a benefit when being trained on CPUs as they
> are more simple computations and the time it would take to send the
> information to the GPU might make things slower, but in the paper you
> emailed out yesterday they mentioned that sparse networks have lower
> accuracy so that might be something else we want to look at additionally.
> However, in this paper (http://arxiv.org/pdf/1102.4240.pdf), sparsity is
> discussed as a means of having a more versatility for a RNN. So if thats
> something we want to look into that's an option.
>
> I think it's becoming clearer that data parallelism would perform better on
> CPUs, whereas, model parallelism would work better on GPUs, which would
> make sense according to what I'm reading here (
> http://arxiv.org/pdf/1404.5997v2.pdf), due to the compute vs data transfer
> bottlenecks. If we could figure out a way to determine whether computation
> per weight or computation per neuron is higher for any given subproblem,
> and be able to allow it to switch quickly between running on the CPU to the
> GPU when this detection happens, we would be in a very good place. Namely,
> for various sub nets within a massively large neural network, where parts
> of the network is more sparse, we would run everything on a CPU, using more
> data parallel techniques, where when we see a denser sub net, we would take
> advantage of the GPU using model parallel techniques.
>
> I also think that this represents the best description of a biological
> system, as the human brain is highly sparse, but at the same time, there
> are certain areas of the brain that are more dense that deal with the more
> computationally intense activities such as vision, and as we have seen
> image detection and tagging, having a toolkit that focuses on the dense and
> parallel makes the most sense on this micro scale. But when it comes to a
> more macro scale problem, i.e. building a single system that would link
> vision with motor movements, I think its safe to say that speed is much
> more important here.
>
> I also could be misunderstanding how this works, so let me know if what I
> am saying make sense.
>
> On Fri, Dec 4, 2015 at 5:30 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
>
> > Hi forks,
> >
> > Instead of compete (performance) with GPU-based deep learning
> > projects, I realized that we need to focus more on CPU cluster and
> > optimization.
> >
> > For instance, we can provide the easy way to design large-scale neural
> > network with more intuitive programming interface. Then, it also can
> > be used for reducing the model size with pruning techniques, so that
> > it can be fitted in GPU memory (if neural network is getting more and
> > more sparse by pruning, CPU might have the advantage over GPU).
> >
> > WDYT?
> >
> > --
> > Best Regards, Edward J. Yoon
> >
>
>
>
> --
> Zach Jaffee
> B.S. Computer Science
> Case Western Reserve University Class of 2017
> Operations Director | WRUW FM 91.1 Cleveland
> Secretary | Recruitment Chair | Phi Kappa Theta Fraternity
> (917) 881-0646
> zjaffee.com
> github.com/ZJaffee
>
>
>


-- 
Zach Jaffee
B.S. Computer Science
Case Western Reserve University Class of 2017
Operations Director | WRUW FM 91.1 Cleveland
Secretary | Recruitment Chair | Phi Kappa Theta Fraternity
(917) 881-0646
zjaffee.com
github.com/ZJaffee

RE: [DISCUSS] CPU-only for large NN

Posted by "Edward J. Yoon" <ed...@samsung.com>.

Good insights :-)

Basically, it's hard to expect GPU would speed-up the sparse NN bc they're 
optimized for dense mat-mult. To regularize NN, there're drop-out or pruning 
techniques as you know. So, CPU-based Apache Horn can be more flexible and 
useful for both data and model parallel than others I think.

However, if high-performance and optimized system is required in real-world 
applications, GPU might better. So, I (originally) mean that we need to focus 
more on flexible and scalable CPU cluster, instead of compete with 
GPU-oriented projects.

--
Best Regards, Edward J. Yoon

-----Original Message-----
From: Zachary Jaffee [mailto:zij@case.edu]
Sent: Saturday, December 05, 2015 5:38 AM
To: Unknown
Subject: Re: [DISCUSS] CPU-only for large NN

Isn't it the case that GPU computing is better for matrix multiplication
and other heavy mathematical workloads, meaning that we would want to
eventually incorporate it in some way. However, I do think that a more
sparse neural network will see a benefit when being trained on CPUs as they
are more simple computations and the time it would take to send the
information to the GPU might make things slower, but in the paper you
emailed out yesterday they mentioned that sparse networks have lower
accuracy so that might be something else we want to look at additionally.
However, in this paper (http://arxiv.org/pdf/1102.4240.pdf), sparsity is
discussed as a means of having a more versatility for a RNN. So if thats
something we want to look into that's an option.

I think it's becoming clearer that data parallelism would perform better on
CPUs, whereas, model parallelism would work better on GPUs, which would
make sense according to what I'm reading here (
http://arxiv.org/pdf/1404.5997v2.pdf), due to the compute vs data transfer
bottlenecks. If we could figure out a way to determine whether computation
per weight or computation per neuron is higher for any given subproblem,
and be able to allow it to switch quickly between running on the CPU to the
GPU when this detection happens, we would be in a very good place. Namely,
for various sub nets within a massively large neural network, where parts
of the network is more sparse, we would run everything on a CPU, using more
data parallel techniques, where when we see a denser sub net, we would take
advantage of the GPU using model parallel techniques.

I also think that this represents the best description of a biological
system, as the human brain is highly sparse, but at the same time, there
are certain areas of the brain that are more dense that deal with the more
computationally intense activities such as vision, and as we have seen
image detection and tagging, having a toolkit that focuses on the dense and
parallel makes the most sense on this micro scale. But when it comes to a
more macro scale problem, i.e. building a single system that would link
vision with motor movements, I think its safe to say that speed is much
more important here.

I also could be misunderstanding how this works, so let me know if what I
am saying make sense.

On Fri, Dec 4, 2015 at 5:30 AM, Edward J. Yoon <ed...@apache.org>
wrote:

> Hi forks,
>
> Instead of compete (performance) with GPU-based deep learning
> projects, I realized that we need to focus more on CPU cluster and
> optimization.
>
> For instance, we can provide the easy way to design large-scale neural
> network with more intuitive programming interface. Then, it also can
> be used for reducing the model size with pruning techniques, so that
> it can be fitted in GPU memory (if neural network is getting more and
> more sparse by pruning, CPU might have the advantage over GPU).
>
> WDYT?
>
> --
> Best Regards, Edward J. Yoon
>

-- 
Zach Jaffee
B.S. Computer Science
Case Western Reserve University Class of 2017
Operations Director | WRUW FM 91.1 Cleveland
Secretary | Recruitment Chair | Phi Kappa Theta Fraternity
(917) 881-0646
zjaffee.com
github.com/ZJaffee

Re: [DISCUSS] CPU-only for large NN

Posted by Zachary Jaffee <zi...@case.edu>.

Isn't it the case that GPU computing is better for matrix multiplication
and other heavy mathematical workloads, meaning that we would want to
eventually incorporate it in some way. However, I do think that a more
sparse neural network will see a benefit when being trained on CPUs as they
are more simple computations and the time it would take to send the
information to the GPU might make things slower, but in the paper you
emailed out yesterday they mentioned that sparse networks have lower
accuracy so that might be something else we want to look at additionally.
However, in this paper (http://arxiv.org/pdf/1102.4240.pdf), sparsity is
discussed as a means of having a more versatility for a RNN. So if thats
something we want to look into that's an option.

I think it's becoming clearer that data parallelism would perform better on
CPUs, whereas, model parallelism would work better on GPUs, which would
make sense according to what I'm reading here (
http://arxiv.org/pdf/1404.5997v2.pdf), due to the compute vs data transfer
bottlenecks. If we could figure out a way to determine whether computation
per weight or computation per neuron is higher for any given subproblem,
and be able to allow it to switch quickly between running on the CPU to the
GPU when this detection happens, we would be in a very good place. Namely,
for various sub nets within a massively large neural network, where parts
of the network is more sparse, we would run everything on a CPU, using more
data parallel techniques, where when we see a denser sub net, we would take
advantage of the GPU using model parallel techniques.

I also think that this represents the best description of a biological
system, as the human brain is highly sparse, but at the same time, there
are certain areas of the brain that are more dense that deal with the more
computationally intense activities such as vision, and as we have seen
image detection and tagging, having a toolkit that focuses on the dense and
parallel makes the most sense on this micro scale. But when it comes to a
more macro scale problem, i.e. building a single system that would link
vision with motor movements, I think its safe to say that speed is much
more important here.

I also could be misunderstanding how this works, so let me know if what I
am saying make sense.

On Fri, Dec 4, 2015 at 5:30 AM, Edward J. Yoon <ed...@apache.org>
wrote:

> Hi forks,
>
> Instead of compete (performance) with GPU-based deep learning
> projects, I realized that we need to focus more on CPU cluster and
> optimization.
>
> For instance, we can provide the easy way to design large-scale neural
> network with more intuitive programming interface. Then, it also can
> be used for reducing the model size with pruning techniques, so that
> it can be fitted in GPU memory (if neural network is getting more and
> more sparse by pruning, CPU might have the advantage over GPU).
>
> WDYT?
>
> --
> Best Regards, Edward J. Yoon
>

-- 
Zach Jaffee
B.S. Computer Science
Case Western Reserve University Class of 2017
Operations Director | WRUW FM 91.1 Cleveland
Secretary | Recruitment Chair | Phi Kappa Theta Fraternity
(917) 881-0646
zjaffee.com
github.com/ZJaffee