Bortom traditionella kodmått: Ett kontextkänsligt ramverk för kvalitetsutvärdering av AI-genererad kod
Information
Författare: Ingrid Sardal, David WestinBeräknat färdigt: 2026-06
Handledare: Lukas Bergliden
Handledares företag/institution: Decerno
Ämnesgranskare: Carl Nettelblad
Övrigt: -
Presentationer
Presentation av Ingrid SardalPresentationstid: 2026-06-12 08:15
Presentation av David Westin
Presentationstid: 2026-06-12 09:15
Opponenter: Hanna Larsson, Julia Ploman
Abstract
The increasing use of generative AI in software development, particularly through large language models, has changed how code is produced and raised new questions regarding how the quality of such code should be assessed. Traditional algorithmic code quality metrics were developed for human-written code at a time when codebases were significantly smaller and simpler than today. Therefore, these metrics may not fully capture the characteristics of AI-generated code in modern software systems. This thesis investigates how AI-generated code can be evaluated in a reliable and meaningful way, focusing on algorithmic metrics as well as specification compliance and contextual interpretation. The study combines a literature review, interviews with developers, and an empirical analysis of AI-generated code. Based on these, an evaluation framework was developed that integrates deterministic algorithmic metrics with an AI-based reviewer, which interpreted the results in relation to the generated code’s functional purpose and broader system context using relevant repository files.
The framework was validated by comparing its reviews with manual code reviews. The results show that developers are still better at identifying subtle semantic issues and design-related considerations, even though the framework catches some of these aspects. In contrast, the framework is more effective at detecting structural code properties. The findings highlight a growing need for automated evaluation tools as the scale of AI-generated code makes manual review increasingly impractical. The thesis concludes that reliable evaluation of AI-generated code requires a hybrid approach, where deterministic metrics provide a stable foundation and AI contributes contextual interpretation.