Predictable Scale Part II: Farseer
paperRefined scaling law for predicting LLM training loss across scales. Introduces a loss surface function L(N,D) that reduces extrapolation error by 433% vs Chinchilla. Trained ~1,000 models using ~3M GPU hours with full open-source release.
Paper
arXiv: 2506.10972