In this work, we investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influences the emergence of attention sink. Note: as of 2023/09/02, xformers ...