You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Naveen Swamy <mn...@gmail.com> on 2020/05/01 05:14:39 UTC

Using AMP

Hello,
I am trying to use AMP on an RNN Model, however I am not seeing higher
throughputs using AMP. also the loss seems to have stagnated. I am
wondering if I am missing something.

Also has AMP has been tested on any RNN models and if there are any
benchmarks ? Appreciate some input here..

I used the RNN model here [1] and followed the tutorial in [2], the output
of the runs are
----
Without AMP:
mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500 --epochs
60  --dropout 0.65 --model gru --batch_size 128

[Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94 samples/s
[Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51 samples/s
[Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
test loss 5.89, test ppl 361.69
[Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46 samples/s
[Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51 samples/s
[Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89

test loss 5.63, test ppl 277.58
----
With AMP:

(gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda --tied
--nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
--batch_size 128 --amp True
Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
dropout=0.65, emsize=1500, epochs=60, export_model=False, gcthreshold=0.5,
gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
nhid=1500, nlayers=2, save='model.params', static_alloc=False,
static_shape=False, tied=True)
using AMP
INFO:root:Using AMP
[Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66 samples/s
[Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99 samples/s
[Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
test loss 10.36, test ppl 31626.99
INFO:root:AMP: increasing loss scale to 131072.000000
[Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83 samples/s
[Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55 samples/s
----

changes made to the training loop after initializing amp and the trainer:

with autograd.record():
    output, hidden = model(data, hidden)
    # Here L is a vector of size batch_size * bptt size
    L = loss(output, target)
    L = L / (args.bptt * args.batch_size)
        with amp.scale_loss(L, trainer) as scaled_loss:
            mx.autograd.backward(scaled_loss)

----
[1]
https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py

[2]
https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html

Thanks, Naveen

Re: Using AMP

Posted by Naveen Swamy <mn...@gmail.com>.

Thanks Przemek, appreciate your input. Let me apply the scale changes to
the gradient clips and run the experiment again.

On Fri, May 1, 2020 at 11:20 AM Przemysław Trędak <pt...@apache.org>
wrote:

> Just realized I did not actually link to the issue I mentioned, it is
> https://github.com/apache/incubator-mxnet/issues/17507
>
> On 2020/05/01 18:19:27, Przemys��aw Tr��dak <pt...@apache.org> wrote:
> > Hi Naveen,
> >
> > The problem that you see with loss is due to the fact that the model
> clips the gradient, which in the case of AMP is scaled by the loss scale.
> In order for it to work you need to apply the same loss scale to the value
> you are using to clip the gradients. This is currently possible in 2 ways,
> either use amp.unscale API to unscale the gradients before clipping, or use
> (currently quite hackily, there is an open issue [1] to expose it properly)
> trainer._amp_loss_scaler.loss_scale to multiply your intended global norm
> of gradients.
> >
> > The topic of gradient clipping with AMP is a common problem people have
> and it should be included in the tutorial. I intend to update the tutorial
> with an example of this together with other changes intended to bring AMP
> out of contrib.
> >
> > Regarding performance - it is quite hard to say what is the reason of
> this without profiling the application - there could be multiple different
> bottleneck here, other than the actual computation on the GPU.
> >
> > Hope this helps :-)
> > Przemek
> >
> > On 2020/05/01 05:14:39, Naveen Swamy <mn...@gmail.com> wrote:
> > > Hello,
> > > I am trying to use AMP on an RNN Model, however I am not seeing higher
> > > throughputs using AMP. also the loss seems to have stagnated. I am
> > > wondering if I am missing something.
> > >
> > > Also has AMP has been tested on any RNN models and if there are any
> > > benchmarks ? Appreciate some input here..
> > >
> > > I used the RNN model here [1] and followed the tutorial in [2], the
> output
> > > of the runs are
> > > ----
> > > Without AMP:
> > > mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500
> --epochs
> > > 60  --dropout 0.65 --model gru --batch_size 128
> > >
> > > [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94
> samples/s
> > > [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51
> samples/s
> > > [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> > > test loss 5.89, test ppl 361.69
> > > [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46
> samples/s
> > > [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51
> samples/s
> > > [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> > >
> > > test loss 5.63, test ppl 277.58
> > > ----
> > > With AMP:
> > >
> > > (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda
> --tied
> > > --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> > > --batch_size 128 --amp True
> > > Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> > > dropout=0.65, emsize=1500, epochs=60, export_model=False,
> gcthreshold=0.5,
> > > gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> > > nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> > > static_shape=False, tied=True)
> > > using AMP
> > > INFO:root:Using AMP
> > > [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66
> samples/s
> > > [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99
> samples/s
> > > [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> > > test loss 10.36, test ppl 31626.99
> > > INFO:root:AMP: increasing loss scale to 131072.000000
> > > [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83
> samples/s
> > > [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55
> samples/s
> > > ----
> > >
> > > changes made to the training loop after initializing amp and the
> trainer:
> > >
> > > with autograd.record():
> > >     output, hidden = model(data, hidden)
> > >     # Here L is a vector of size batch_size * bptt size
> > >     L = loss(output, target)
> > >     L = L / (args.bptt * args.batch_size)
> > >         with amp.scale_loss(L, trainer) as scaled_loss:
> > >             mx.autograd.backward(scaled_loss)
> > >
> > > ----
> > > [1]
> > >
> https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> > >
> > > [2]
> > >
> https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> > >
> > > Thanks, Naveen
> > >
> >
>

Re: Using AMP

Posted by Przemys��aw Tr��dak <pt...@apache.org>.

Just realized I did not actually link to the issue I mentioned, it is https://github.com/apache/incubator-mxnet/issues/17507

On 2020/05/01 18:19:27, Przemys��aw Tr��dak <pt...@apache.org> wrote: 
> Hi Naveen,
> 
> The problem that you see with loss is due to the fact that the model clips the gradient, which in the case of AMP is scaled by the loss scale. In order for it to work you need to apply the same loss scale to the value you are using to clip the gradients. This is currently possible in 2 ways, either use amp.unscale API to unscale the gradients before clipping, or use (currently quite hackily, there is an open issue [1] to expose it properly) trainer._amp_loss_scaler.loss_scale to multiply your intended global norm of gradients.
> 
> The topic of gradient clipping with AMP is a common problem people have and it should be included in the tutorial. I intend to update the tutorial with an example of this together with other changes intended to bring AMP out of contrib.
> 
> Regarding performance - it is quite hard to say what is the reason of this without profiling the application - there could be multiple different bottleneck here, other than the actual computation on the GPU.
> 
> Hope this helps :-)
> Przemek
> 
> On 2020/05/01 05:14:39, Naveen Swamy <mn...@gmail.com> wrote: 
> > Hello,
> > I am trying to use AMP on an RNN Model, however I am not seeing higher
> > throughputs using AMP. also the loss seems to have stagnated. I am
> > wondering if I am missing something.
> > 
> > Also has AMP has been tested on any RNN models and if there are any
> > benchmarks ? Appreciate some input here..
> > 
> > I used the RNN model here [1] and followed the tutorial in [2], the output
> > of the runs are
> > ----
> > Without AMP:
> > mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500 --epochs
> > 60  --dropout 0.65 --model gru --batch_size 128
> > 
> > [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94 samples/s
> > [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51 samples/s
> > [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> > test loss 5.89, test ppl 361.69
> > [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46 samples/s
> > [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51 samples/s
> > [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> > 
> > test loss 5.63, test ppl 277.58
> > ----
> > With AMP:
> > 
> > (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda --tied
> > --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> > --batch_size 128 --amp True
> > Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> > dropout=0.65, emsize=1500, epochs=60, export_model=False, gcthreshold=0.5,
> > gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> > nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> > static_shape=False, tied=True)
> > using AMP
> > INFO:root:Using AMP
> > [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66 samples/s
> > [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99 samples/s
> > [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> > test loss 10.36, test ppl 31626.99
> > INFO:root:AMP: increasing loss scale to 131072.000000
> > [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83 samples/s
> > [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55 samples/s
> > ----
> > 
> > changes made to the training loop after initializing amp and the trainer:
> > 
> > with autograd.record():
> >     output, hidden = model(data, hidden)
> >     # Here L is a vector of size batch_size * bptt size
> >     L = loss(output, target)
> >     L = L / (args.bptt * args.batch_size)
> >         with amp.scale_loss(L, trainer) as scaled_loss:
> >             mx.autograd.backward(scaled_loss)
> > 
> > ----
> > [1]
> > https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> > 
> > [2]
> > https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> > 
> > Thanks, Naveen
> > 
>

Re: Using AMP

Posted by Przemys��aw Tr��dak <pt...@apache.org>.

Hi Naveen,

The problem that you see with loss is due to the fact that the model clips the gradient, which in the case of AMP is scaled by the loss scale. In order for it to work you need to apply the same loss scale to the value you are using to clip the gradients. This is currently possible in 2 ways, either use amp.unscale API to unscale the gradients before clipping, or use (currently quite hackily, there is an open issue [1] to expose it properly) trainer._amp_loss_scaler.loss_scale to multiply your intended global norm of gradients.

The topic of gradient clipping with AMP is a common problem people have and it should be included in the tutorial. I intend to update the tutorial with an example of this together with other changes intended to bring AMP out of contrib.

Regarding performance - it is quite hard to say what is the reason of this without profiling the application - there could be multiple different bottleneck here, other than the actual computation on the GPU.

Hope this helps :-)
Przemek

On 2020/05/01 05:14:39, Naveen Swamy <mn...@gmail.com> wrote: 
> Hello,
> I am trying to use AMP on an RNN Model, however I am not seeing higher
> throughputs using AMP. also the loss seems to have stagnated. I am
> wondering if I am missing something.
> 
> Also has AMP has been tested on any RNN models and if there are any
> benchmarks ? Appreciate some input here..
> 
> I used the RNN model here [1] and followed the tutorial in [2], the output
> of the runs are
> ----
> Without AMP:
> mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500 --epochs
> 60  --dropout 0.65 --model gru --batch_size 128
> 
> [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94 samples/s
> [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51 samples/s
> [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> test loss 5.89, test ppl 361.69
> [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46 samples/s
> [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51 samples/s
> [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> 
> test loss 5.63, test ppl 277.58
> ----
> With AMP:
> 
> (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda --tied
> --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> --batch_size 128 --amp True
> Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> dropout=0.65, emsize=1500, epochs=60, export_model=False, gcthreshold=0.5,
> gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> static_shape=False, tied=True)
> using AMP
> INFO:root:Using AMP
> [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66 samples/s
> [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99 samples/s
> [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> test loss 10.36, test ppl 31626.99
> INFO:root:AMP: increasing loss scale to 131072.000000
> [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83 samples/s
> [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55 samples/s
> ----
> 
> changes made to the training loop after initializing amp and the trainer:
> 
> with autograd.record():
>     output, hidden = model(data, hidden)
>     # Here L is a vector of size batch_size * bptt size
>     L = loss(output, target)
>     L = L / (args.bptt * args.batch_size)
>         with amp.scale_loss(L, trainer) as scaled_loss:
>             mx.autograd.backward(scaled_loss)
> 
> ----
> [1]
> https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> 
> [2]
> https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> 
> Thanks, Naveen
>