For years, the promise of automated requirements classification—the critical task of categorising requirements as functional, non-functional, security-related, and so on—has been tantalisingly close, driven by advances in Natural Language Processing (NLP). The recent explosion of Large Language Models (LLMs) like GPT-5, Grok-4, and Claude-4 seemed to be the final piece of the puzzle, delivering state-of-the-art performance in text understanding and classification.
Yet, for the software industry, this reliance on colossal, proprietary models presents a host of intractable problems. The sheer computational cost of running trillion-parameter models is prohibitive for many organisations. More critically, the typical cloud-hosted, closed-source nature of these LLMs introduces significant data-sharing risks. Company requirements are often highly confidential assets, and sending them to an external, proprietary service is a non-starter for security-conscious firms. This dependency compromises privacy, security, and reproducibility, limiting the ability of researchers and practitioners to adapt models to their specific, sensitive needs.
Enter the Small Language Models (SLMs). These open-source alternatives, typically ranging from 7 to 8 billion parameters (e.g., Llama-3-8B, Qwen2-7B, Falcon-7B), offer a compelling solution. They are lightweight, locally deployable on private machines or servers, and enable the secure, cost-effective processing of sensitive data. Furthermore, their local execution significantly reduces operational costs and offers substantial benefits in terms of energy consumption and customisation.
The central question, however, has remained: Can these minnows truly compete with the titans? Do the advantages of privacy and resource efficiency come at an unacceptable cost to accuracy?
Our recent preliminary study, detailed in the paper "Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification" [1], set out to answer this question through a systematic, head-to-head comparison. The results are a game-changer for the future of automated RE.
To ensure a fair and rigorous comparison, we evaluated eight language models:
The size difference is staggering: the LLMs are 100 to 300 times larger than their SLM counterparts.
All models were tested on three public, well-established requirements classification datasets:
To maximise performance and ensure a robust evaluation, we employed a sophisticated Chain-of-Thought (CoT) plus Few-Shot prompting strategy, which has been shown to be highly effective in prior work [16].
The descriptive statistics initially showed a slight edge for the LLMs, which achieved an average F1 score of 2% higher than the SLMs across all datasets. For instance, the top-performing LLM, Claude-4, consistently led with F1 scores of 0.81 (PROMISE), 0.80 (PROMISE Reclass), and 0.89 (SecReq). The best SLM, Llama-3-8B, was close behind, achieving scores of 0.76, 0.78, and 0.88 on the same tasks, respectively [2].
However, a deeper dive into the statistical analysis revealed the true story. Using the Scheirer-Ray-Hare test, a non-parametric equivalent of two-way ANOVA, we tested the null hypothesis that model type (SLM vs. LLM) has no effect on performance.
The finding was definitive: the performance difference between SLMs and LLMs is NOT statistically significant (p = 0.296) [1].
This means that, despite being up to 300 times smaller, the SLMs are functionally equivalent to the LLMs for the task of requirements classification.
The competition was not just a tie; in specific, crucial metrics, the SLMs demonstrated specialised strengths:
These results suggest that SLMs are not just "good enough"; they can be the optimal choice when a specific performance trade-off (like high recall) is required.
Perhaps the most critical finding of the study was that dataset characteristics play a far more significant role in performance than model size.
The statistical analysis showed a highly significant main effect of the dataset on the F1 score (p < 0.001, with a large effect size \(\eta^2_H = 0.63\)) [1]. All models-both SLMs and LLMs-performed worst on PROMISE Reclass and best on SecReq. Critically, the interaction between Model Type and Dataset was not significant (p = 0.790) [1]. This confirms that all models are affected by the dataset complexity in a similar manner, further supporting the conclusion that model size is a secondary factor.
The conclusion is clear: SLMs are a valid, high-performance alternative to LLMs for requirements classification.
For companies, this research provides the evidence needed to pivot away from expensive, privacy-compromising cloud-based LLMs. A 2% marginal loss in F1 score is an acceptable trade-off when weighed against the immense advantages of:
This shift empowers organisations to maintain control over their most sensitive intellectual property while still leveraging the power of modern NLP for automated RE tasks.
While this study focused on binary classification, the future of SLMs in RE is bright. Our roadmap includes exploring:
The era of blindly chasing the largest model is over. For requirements classification, the evidence suggests that the smart, secure, and sustainable choice is often the small one.
Zadenoori, M.A., et al. Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification (2025). Page 1
Zadenoori, M.A., et al. Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification (2025). Page 3
Cleland-Huang, J., et al. Automated classification of non-functional requirements (2007). Page 3
Dalpiaz, F., et al. Requirements classification with interpretable machine learning and dependency parsing (2019). Page 3
Knauss, E., et al. Supporting requirements engineers in recognizing security issues (2011). Page 3
Zadenoori, M.A., et al. Automatic prompt engineering: The case of requirements classification (2025). Page 6
Academic Excellence is not all it takes to be a Data Scientist
Enter your email address to subscribe to this blog and receive notifications of new postes by email.
2025 © Ankit Goyanka. All Right Reserved.
Designed by Saurabh Vaid and developed by PCube.Tech