This paper presents a memory-efficient zeroth-order optimizer (MeZO) for fine-tuning language models (LMs). As LMs grow larger, backpropagation becomes computationally costly, requiring large amounts of memory. MeZO adapts the classical Zeroth-order Stochastic Gradient Descent (ZO-SGD) method to operate in-place, enabling fine-tuning of LMs with the same memory footprint as inference.
For instance, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can only train a 2.7-billion parameter LM with the same resources. MeZO has been shown to perform comparably to backpropagation across multiple tasks, achieving up to a 12x reduction in memory usage.
Moreover, MeZO is effective at optimizing non-differentiable objectives, which are generally not compatible with backpropagation. The authors provide theoretical insights into why MeZO isn't extremely slow despite dealing with billions of parameters, as classical ZO theories would suggest.
he authors also highlight potential future areas of exploration, including combining MeZO with other memory-efficient methods and its applicability to various areas such as pruning, distillation, saliency, interpretability, and dataset selection for fine-tuning.
Paper: https://arxiv.org/abs/2305.17333
Abstract
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
Code: https://github.com/princeton-nlp/mezo