LLM Details Summary

论文列表

汇总表

ModelTimeBiasAuxiliary lossLogits scale

PaLM

2204

No

$10^{-4} \cdot \log ^2 Z$

$\frac 1 {\sqrt d}$

Galactica

2211

Bias

从PaLM应该是第一篇明确说Linear层不使用Bias项的论文:

No Biases – No biases were used in any of the dense kernels or layer norms. We found this to result in increased training stability for large models.

Galactica沿用了PaLM的配置,没有使用Bias。

Training Instability

PaLM指出,随着训练的进行,loss会出现spike的现象,解决方案是从spike的位置回滚100个steps,然后跳过200到500的数据。PaLM也做了实验,指出出现这个问题的原因不完全是因为数据,和模型当时的参数值也有关系:

For the largest model, we observed spikes in the loss roughly 20 times during training, despite the fact that gradient clipping was enabled. These spikes occurred at highly irregular intervals, sometimes happening late into training, and were not observed when training the smaller models. Due to the cost of training the largest model, we were not able to determine a principled strategy to mitigate these spikes. Instead, we found that a simple strategy to effectively mitigate the issue: We re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200–500 data batches, which cover the batches that were seen before and during the spike. With this mitigation, the loss did not spike again at the same point. We do not believe that the spikes were caused by “bad data” per se, because we ran several ablation experiments where we took the batches of data that were surrounding the spike, and then trained on those same data batches starting from a different, earlier checkpoint. In these cases, we did not see a spike. This implies that spikes only occur due to the combination of specific data batches with a particular model parameter state. In the future, we plan to study more principled mitigation strategy for loss spikes in very large language models.

Auxiliary loss

PaLM使用了$10^{-4} \cdot \log ^2 Z$的auxiliary loss,指出这样可以使训练更稳定。

Weight initialization

PaLM对于Logits值scale了$\frac 1 {\sqrt{d}}$,其中$d$是embedding dim。

Last updated