Nexus: Common Minima for Better Generalization
paper"Same Pretraining Loss, Better Downstream Generalization via Common Minima." Proposes that the geometric closeness of task-specific minima during pretraining is intrinsically linked to downstream generalization. Introduces Nexus, an optimizer that encourages alignment of minima by maximizing gradient similarity.
At 3B scale, achieves comparable pretraining loss while reducing out-of-distribution loss by 0.012 and improving reasoning accuracy by up to 15% on GSM8K. Challenges the assumption that pretraining loss alone indicates model quality. By Chen, Zhang, Li, Dong, Shen, and Zhu (ByteDance Seed + Tsinghua).