LAMB¶

class flash.core.optimizers.LAMB(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, exclude_from_layer_adaptation=False, amsgrad=False)[source]¶

Extends ADAM in pytorch to incorporate LAMB algorithm from the paper: Large batch optimization for deep learning: Training BERT in 76 minutes.

Parameters

params¶ (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr¶ (float) – learning rate
betas¶ (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps¶ (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay¶ (float, optional) – weight decay (L2 penalty) (default: 0)
exclude_from_layer_adaptation¶ (bool, optional) – layers which do not need LAMB layer adaptation (default: False)
amsgrad¶ (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

Example

>>> model = nn.Linear(10, 1)
>>> optimizer = LAMB(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> # loss_fn(model(input), target).backward()
>>> optimizer.step()

Warning

Since the default weight decay for LAMB is set to 0., we do not club together 0. weight decay and exclusion from layer adaptation like LARS. This would cause the optimizer to exclude all layers from layer adaptation.

step(closure=None)¶

Performs a single optimization step.

Parameters: closure¶ (callable, optional) – A closure that reevaluates the model and returns the loss.