In this work, we investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influences the emergence of attention sink. Note: as of 2023/09/02, xformers ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results