LLMs on the Line: Data Determines Loss-To-Loss Scaling Laws

Prasanna Mayilvahanan*
MPI-IS, University of Tübingen, Tübingen AI Center

Thaddäus Wiedemer*
MPI-IS, University of Tübingen, Tübingen AI Center

Sayak Mallick
University of Tübingen, MPI-IS, Tübingen AI Center

Matthias Bethge
University of Tübingen, Tübingen AI Center

Wieland Brendel
MPI-IS, ELLIS Institute Tübingen, Tübingen AI Center

Code

tl;dr: We find that loss-to-loss scaling laws depend on pretraining data and tokenizer, not model size, optimization, or even significant architectural differences.

News

Feb '25

The pre-print is now available on arXiv.

Abstract

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Overview

We make three main observations (illustrated in the figure below):

LLMs' loss-to-loss scaling consistently follows shifted power laws.

Pretraining data and tokenizer are the most salient factors for these scaling laws.

In contrast, architecture plays a minor role, while model size, context length, and optimizer settings have negligible impact on loss-to-loss scalin trends.

What did we essentially do and how?

Our study investigates how different design choices impact loss-to-loss scaling laws in LLMs. Using over 6,000 model configurations, we conduct controlled interventions by varying factors such as pretraining data, tokenizer, architecture, model size, and optimization settings. By analyzing the resulting changes in loss-to-loss scaling laws, we uncover key insights into what truly drives them.

What does this imply?

Traditionally, LLM development has placed heavy emphasis on architectural improvements and model scaling strategies. However, our research demonstrates that even drastically different architectures, such as LLaMA (a transformer-based model) and Mamba (a state-space model), exhibit nearly identical loss-to-loss scaling when trained on the same data. Meanwhile, changing the pretraining dataset leads to significant shifts in the scaling behavior, meaning that dataset curation is a far more influential lever for improving downstream performance. These insights suggest that practitioners should rethink their optimization strategies—prioritizing high-quality, well-curated datasets if they want to improve downstream performance.

Acknowledgements & Funding

We would like to thank (in alphabetical order) Ameya Prabhu, Attila Juhos, Evgenia Rusak, Fanfei Li, Jack Brady, Thomas Klein, and Vishaal Udandarao for helpful discussions and feedback. We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting PM and TW. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG.

BibTeX

If you find our study helpful, please cite our paper:

@inproceedings{mayilvahanan2025llmsline,
  title={LLMs on the Line: Data Determines Loss-To-Loss Scaling Laws},
  author={
    Prasanna Mayilvahanan and Thaddäus Wiedemer and Sayak Mallick and Matthias Bethge and Wieland Brendel
  },
  publisher={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2502.12120}
}

Webpage designed using Bootstrap 4.5 following a layout by Roland Zimmermann.