Dense LLaMA‑3.1‑8B | PuDDing (20 % pruned) | |
---|---|---|
Memory | 16 GB | 12.8 GB (↓ 3.2 GB) |
Pre‑fill time | 251 ms | 206 ms (1.22× faster) |
Accuracy | 74.9 % | 62.93 % (‑11.97 %p) |
Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
A key motivation behind PuDDing is the observation that the importance of transformer blocks in large language models is highly task-dependent. A block that is essential for one task may be redundant for another. For example, in our experiments with LLaMA 3.1-8B, replacing block 30 with block 29 led to a sharp accuracy drop on BoolQ (over 20%p), while slightly improving performance on PIQA and WinoGrande. This suggests that different blocks capture task-specific knowledge.
These findings highlight the limitation of static pruning and motivate our core idea: pruning should be dynamic and prompt-aware. Rather than relying on a fixed omission set, PuDDing adapts the model architecture to each input prompt, achieving both efficiency and task-specialized performance.
The framework works in two stages. First, we construct a small but effective pool of candidate omission sets by evaluating task-specific losses on calibration datasets. Second, we train a transformer-based router that maps each prompt to the most suitable omission set in this pool.
At inference time, given a prompt, the router selects an omission set that best preserves task performance. The model then loads only the retained blocks into high-speed memory, skipping the pruned ones. This design significantly reduces loading time and computation cost while maintaining high accuracy across a variety of tasks.
We evaluated PuDDing on zero-shot commonsense reasoning tasks using the LLaMA-3.1 8B model and compared its performance with state-of-the-art compression methods. As shown in the results above, PuDDing consistently achieves the highest average accuracy across all tested sparsity levels. In particular, when pruning 7 blocks (over 20% sparsity), it outperforms the best baseline by nearly 3 percentage points.
In addition, PuDDing demonstrates strong generalization on more complex, unseen tasks such as OpenBookQA, MathQA, and MMLU. Even though these tasks were not seen during router training, PuDDing maintains superior performance compared to existing pruning baselines. For detailed results, please refer to Table 5 in our paper.
We analyze the pruning patterns of LLaMA-3.1 8B under 20% sparsity across zero-shot tasks. Some blocks (e.g., 20, 26, 27) are consistently pruned, while others (e.g., 1–3, 5–8) are almost always retained. Certain blocks show task-specific behavior—for instance, block 4 is often pruned in ARC tasks but retained in PIQA and WinoGrande. Similar trends are observed in OPT and Vicuna models (see Appendix C).
Above table shows that PuDDing, with 21.88% sparsity, achieves a consistent 1.2× speedup over the dense LLaMA-3.1 8B model on both A100 and RTX 6000 Ada GPUs—across both pre-fill and generation stages—while the routing overhead remains minimal (4–8ms).
@inproceedings{wee2025prompt,
title={Prompt-based Depth Pruning of Large Language Models},
author={Wee, Juyun and Park, Minjae and Lee, Jaeho},
booktitle={International Conference on Machine Learning},
year={2025}
}