Does Model Size Matter?
Why Small Language Models Are the Future of Requirements Engineering

The Requirements Engineering (RE) community is at a crossroads

For years, the promise of automated requirements classification—the critical task of categorising requirements as functional, non-functional, security-related, and so on—has been tantalisingly close, driven by advances in Natural Language Processing (NLP). The recent explosion of Large Language Models (LLMs) like GPT-5, Grok-4, and Claude-4 seemed to be the final piece of the puzzle, delivering state-of-the-art performance in text understanding and classification.

Yet, for the software industry, this reliance on colossal, proprietary models presents a host of intractable problems. The sheer computational cost of running trillion-parameter models is prohibitive for many organisations. More critically, the typical cloud-hosted, closed-source nature of these LLMs introduces significant data-sharing risks. Company requirements are often highly confidential assets, and sending them to an external, proprietary service is a non-starter for security-conscious firms. This dependency compromises privacy, security, and reproducibility, limiting the ability of researchers and practitioners to adapt models to their specific, sensitive needs.


The Rise of the Underdogs: Small Language Models (SLMs)

Enter the Small Language Models (SLMs). These open-source alternatives, typically ranging from 7 to 8 billion parameters (e.g., Llama-3-8B, Qwen2-7B, Falcon-7B), offer a compelling solution. They are lightweight, locally deployable on private machines or servers, and enable the secure, cost-effective processing of sensitive data. Furthermore, their local execution significantly reduces operational costs and offers substantial benefits in terms of energy consumption and customisation.

The central question, however, has remained: Can these minnows truly compete with the titans? Do the advantages of privacy and resource efficiency come at an unacceptable cost to accuracy?

Our recent preliminary study, detailed in the paper "Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification" [1], set out to answer this question through a systematic, head-to-head comparison. The results are a game-changer for the future of automated RE.


Pitting Titans Against Minnows: The Study Design

To ensure a fair and rigorous comparison, we evaluated eight language models:

  • Three LLMs (Titans): GPT-5, xAI Grok-4, and Claude-4 (estimated at 1-2 trillion parameters). These were accessed via commercial APIs.
  • Five SLMs (Minnows): Qwen2-7B-Instruct, Falcon-7B-Instruct, Granite-3.2-8B-Instruct, Ministral-8B-Instruct-2410, and Meta-Llama-3-8B-Instruct (ranging from 7B to 8B parameters). These were executed locally on a high-performance Linux server.

The size difference is staggering: the LLMs are 100 to 300 times larger than their SLM counterparts.

All models were tested on three public, well-established requirements classification datasets:

  1. PROMISE [3]: Binary classification of Functional Requirements (FR) vs. Non-Functional Requirements (NFR).
  2. PROMISE Reclass [4]: A re-classification of PROMISE with two binary subtasks (FR vs. NFR and QR vs. Non-QR).
  3. SecReq [10]: Binary classification of Security-related (Sec) vs. Non-Security (NSec) requirements.

To maximise performance and ensure a robust evaluation, we employed a sophisticated Chain-of-Thought (CoT) plus Few-Shot prompting strategy, which has been shown to be highly effective in prior work [16].


The Shocking Results: A Statistical Tie

The descriptive statistics initially showed a slight edge for the LLMs, which achieved an average F1 score of 2% higher than the SLMs across all datasets. For instance, the top-performing LLM, Claude-4, consistently led with F1 scores of 0.81 (PROMISE), 0.80 (PROMISE Reclass), and 0.89 (SecReq). The best SLM, Llama-3-8B, was close behind, achieving scores of 0.76, 0.78, and 0.88 on the same tasks, respectively [2].

However, a deeper dive into the statistical analysis revealed the true story. Using the Scheirer-Ray-Hare test, a non-parametric equivalent of two-way ANOVA, we tested the null hypothesis that model type (SLM vs. LLM) has no effect on performance.

The finding was definitive: the performance difference between SLMs and LLMs is NOT statistically significant (p = 0.296) [1].

This means that, despite being up to 300 times smaller, the SLMs are functionally equivalent to the LLMs for the task of requirements classification.


Nuanced Strengths: Where SLMs Excel

The competition was not just a tie; in specific, crucial metrics, the SLMs demonstrated specialised strengths:

  • Recall Advantage: In the PROMISE Reclass dataset, the SLMs Qwen2-7B and Falcon3-7B achieved a remarkable Recall of 0.96, significantly outperforming every LLM [2]. This indicates that for scenarios where minimising false negatives (missing a relevant requirement) is paramount, SLMs are exceptionally effective.
  • Precision Wins: The SLM Ministral-8B also outperformed the LLM Grok-4 in Precision on the PROMISE Reclass dataset, suggesting greater reliability in its positive predictions for that specific context [2].

These results suggest that SLMs are not just "good enough"; they can be the optimal choice when a specific performance trade-off (like high recall) is required.


The Dominant Factor: The Dataset

Perhaps the most critical finding of the study was that dataset characteristics play a far more significant role in performance than model size.

The statistical analysis showed a highly significant main effect of the dataset on the F1 score (p < 0.001, with a large effect size \(\eta^2_H = 0.63\)) [1]. All models-both SLMs and LLMs-performed worst on PROMISE Reclass and best on SecReq. Critically, the interaction between Model Type and Dataset was not significant (p = 0.790) [1]. This confirms that all models are affected by the dataset complexity in a similar manner, further supporting the conclusion that model size is a secondary factor.


Implications for Industry: A Viable, Private Alternative

The conclusion is clear: SLMs are a valid, high-performance alternative to LLMs for requirements classification.

For companies, this research provides the evidence needed to pivot away from expensive, privacy-compromising cloud-based LLMs. A 2% marginal loss in F1 score is an acceptable trade-off when weighed against the immense advantages of:

  • Data Privacy: Secure, local processing of confidential requirements.
  • Cost Efficiency: Eliminating expensive API calls and cloud infrastructure dependency.
  • Resource Management: Lower energy consumption and easier customisation.

This shift empowers organisations to maintain control over their most sensitive intellectual property while still leveraging the power of modern NLP for automated RE tasks.


The Road Ahead

While this study focused on binary classification, the future of SLMs in RE is bright. Our roadmap includes exploring:

  • Explainability [1]: Moving beyond simple labels to generate justifications for classifications, which is crucial for practitioner confidence and downstream tasks like traceability.
  • Hybrid Pipelines [1]: Identifying workflows where traditional ML, SLMs (for explanation), and LLMs (for conversational assistance) can be combined for maximum efficiency.
  • Energy and Speed [1]: Quantifying the energy footprint and execution speed differences, especially in local vs. cloud deployments, to provide a complete cost-benefit analysis.

The era of blindly chasing the largest model is over. For requirements classification, the evidence suggests that the smart, secure, and sustainable choice is often the small one.


References

Zadenoori, M.A., et al. Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification (2025). Page 1

Zadenoori, M.A., et al. Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification (2025). Page 3

Cleland-Huang, J., et al. Automated classification of non-functional requirements (2007). Page 3

Dalpiaz, F., et al. Requirements classification with interpretable machine learning and dependency parsing (2019). Page 3

Knauss, E., et al. Supporting requirements engineers in recognizing security issues (2011). Page 3

Zadenoori, M.A., et al. Automatic prompt engineering: The case of requirements classification (2025). Page 6

You Might Also Like
Blog
Data Scientist

Academic Excellence is not all it takes to be a Data Scientist

Blog
The future of identity verification

Jumio makes customer verification and checkout process easier and more secure.

Blog
Does Model Size Matter?

Why Small Language Models Are the Future of Requirements Engineering



Featured Blog
Blog

Jumio makes customer verification and checkout process easier and more secure.

Newsletter

Enter your email address to subscribe to this blog and receive notifications of new postes by email.