Shortcuts

LAMB

class flash.core.optimizers.LAMB(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, exclude_from_layer_adaptation=False, amsgrad=False)[source]

Extends ADAM in pytorch to incorporate LAMB algorithm from the paper: Large batch optimization for deep learning: Training BERT in 76 minutes.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • exclude_from_layer_adaptation (bool, optional) – layers which do not need LAMB layer adaptation (default: False)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

Example

>>> from torch import nn
>>> model = nn.Linear(10, 1)
>>> optimizer = LAMB(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> # loss_fn(model(input), target).backward()
>>> optimizer.step()

Warning

Since the default weight decay for LAMB is set to 0., we do not club together 0. weight decay and exclusion from layer adaptation like LARS. This would cause the optimizer to exclude all layers from layer adaptation.

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

Read the Docs v: latest
Versions
latest
stable
0.8.2
0.8.1.post0
0.8.1
0.8.0
0.7.5
0.7.4
0.7.3
0.7.2
0.7.1
0.7.0
0.6.0
0.5.2
0.5.1
0.5.0
0.4.0
0.3.2
0.3.1
0.3.0
0.2.3
0.2.2
0.2.1
0.2.0
0.1.0post1
Downloads
html
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.