🍮 PuDDing: Prompt-routed Dynamic Depth Pruning

Pohang University of Science and Technology (POSTECH), South Korea.
ICML 2025
Framework image

PuDDing (Prompt-routed Dynamic Depth Pruning) reduces memory usage and accelerates inference of large language models by selectively removing Transformer blocks based on the input prompt using a lightweight pretrained router.

đź‘€ Quick glance at gains

Dense LLaMA‑3.1‑8B PuDDing (20 % pruned)
Memory 16 GB 12.8 GB (↓ 3.2 GB)
Pre‑fill time 251 ms 206 ms (1.22× faster)
Accuracy 74.9 % 62.93 % (‑11.97 %p)

🔍 Under the hood (30 s read)

  • Task‑aware pruning: Block importance changes with the question. Our router learns this mapping from a handful of calibration datasets.
  • Single‑shot routing: Decisions are made once per prompt, so runtime overhead stays below 8 ms.
  • Plug‑and‑play: Works on LLaMA, Vicuna, OPT and friends — no retraining of the backbone required.

Abstract

Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.

Observation

Framework image

A key motivation behind PuDDing is the observation that the importance of transformer blocks in large language models is highly task-dependent. A block that is essential for one task may be redundant for another. For example, in our experiments with LLaMA 3.1-8B, replacing block 30 with block 29 led to a sharp accuracy drop on BoolQ (over 20%p), while slightly improving performance on PIQA and WinoGrande. This suggests that different blocks capture task-specific knowledge.

These findings highlight the limitation of static pruning and motivate our core idea: pruning should be dynamic and prompt-aware. Rather than relying on a fixed omission set, PuDDing adapts the model architecture to each input prompt, achieving both efficiency and task-specialized performance.

Framework

Framework image

The framework works in two stages. First, we construct a small but effective pool of candidate omission sets by evaluating task-specific losses on calibration datasets. Second, we train a transformer-based router that maps each prompt to the most suitable omission set in this pool.

At inference time, given a prompt, the router selects an omission set that best preserves task performance. The model then loads only the retained blocks into high-speed memory, skipping the pruned ones. This design significantly reduces loading time and computation cost while maintaining high accuracy across a variety of tasks.

Main Results

Interpolate start reference image.

We evaluated PuDDing on zero-shot commonsense reasoning tasks using the LLaMA-3.1 8B model and compared its performance with state-of-the-art compression methods. As shown in the results above, PuDDing consistently achieves the highest average accuracy across all tested sparsity levels. In particular, when pruning 7 blocks (over 20% sparsity), it outperforms the best baseline by nearly 3 percentage points.

In addition, PuDDing demonstrates strong generalization on more complex, unseen tasks such as OpenBookQA, MathQA, and MMLU. Even though these tasks were not seen during router training, PuDDing maintains superior performance compared to existing pruning baselines. For detailed results, please refer to Table 5 in our paper.

Pruning Behavior and Latency Overview

Interpolate start reference image.

We analyze the pruning patterns of LLaMA-3.1 8B under 20% sparsity across zero-shot tasks. Some blocks (e.g., 20, 26, 27) are consistently pruned, while others (e.g., 1–3, 5–8) are almost always retained. Certain blocks show task-specific behavior—for instance, block 4 is often pruned in ARC tasks but retained in PIQA and WinoGrande. Similar trends are observed in OPT and Vicuna models (see Appendix C).

Interpolate start reference image.

Above table shows that PuDDing, with 21.88% sparsity, achieves a consistent 1.2× speedup over the dense LLaMA-3.1 8B model on both A100 and RTX 6000 Ada GPUs—across both pre-fill and generation stages—while the routing overhead remains minimal (4–8ms).

BibTeX

@inproceedings{wee2025prompt,
          title={Prompt-based Depth Pruning of Large Language Models},
          author={Wee, Juyun and Park, Minjae and Lee, Jaeho},
          booktitle={International Conference on Machine Learning},
          year={2025}
        }