publication
From Syntax Trees to Embeddings: A Comparative Study of AI-Generated Code Detection
A comparative study of AI-generated code detection using engineered syntax-tree features, CodeBERT embeddings, and graph-based syntax representations.
Large language models are increasingly used to generate program code, which complicates code authorship attribution in settings where originality is required, especially academic assessment.
This paper studies binary classification of source code as either AI-generated or human-written using the CoDeT-M4 dataset. It compares three families of approaches: traditional machine learning on engineered features extracted from concrete syntax trees, embedding-based deep models built on pretrained CodeBERT representations, and graph-based models operating on concrete syntax tree-derived graphs.
Download the paper PDF
Motivation
LLMs make programming more accessible, much easier, and faster, even for people who have little or no experience in the field. That creates a problem in spaces where code must be fully written by the human, or where plagiarism is strictly forbidden.
One such space is the academic setting, in which students must be evaluated on their skills. Students can copy and paste the description of their exercises or tasks as a prompt into these models and get a working solution in one, or only a few iterations, thereby avoiding the solitary solving of the task.
Data and Setup
The study uses CoDeT-M4, which contains 500,552 samples: 246,221 human and 254,331 AI. The samples cover Java, Python, and C++, and are split into train, validation, and test subsets.
During data exploration, minor leakage was identified: 1,234 samples appeared in more than one subset. Before experiments, overlaps were removed by giving priority to train, then test, so the evaluation used a cleaned split.
Methods
Concrete syntax trees were built with Tree-Sitter. CSTs were used instead of ASTs because this task targets authorship cues, where low-level syntactic decisions, formatting-related structure, and comment placement may carry signal that AST abstractions often discard.
The traditional machine-learning baselines used engineered CST features such as functions, classes, if statements, loops, imports, comments, binary operations, and parse errors. The embedding-based models used pretrained CodeBERT representations, including a Multi-Scale CNN and a Multi-Scale CNN + Bi-LSTM adapted for code. The graph-based models converted CSTs into PyTorch-Geometric graphs and evaluated GraphSAGE, GAT, and Graph Transformer variants.
Results
Embedding-based methods achieved the best overall performance. The Multi-Scale CNN + Bi-LSTM reached 0.9836 accuracy and 0.9836 F1, showing that CodeBERT embeddings benefit from local convolutional patterns and a smaller additional gain from Bi-LSTM sequence modeling.
Graph-based methods remained competitive. The best graph-convolutional setup used GraphSAGE with comment indicator nodes and mean pooling, reaching 0.9453 accuracy and 0.9432 F1. This showed that comment presence and placement, even without text, carry stylistic signal relevant to authorship.
Traditional machine-learning models lagged behind both deep-learning families. Random Forest and CatBoost were the strongest traditional baselines, but the results suggest that handcrafted CST features alone are not enough for peak predictive performance.
Error Analysis
False negatives were inspected for the strongest embedding-based run on the held-out test split. Out of 47,046 samples, 270 were false negatives. Most errors were concentrated in low-information snippets, often short utility wrappers or one-expression routines where AI and human style overlap.
Takeaway
Embedding-based approaches maximize accuracy, while CST-derived graph models remain structurally grounded and more directly auditable at the syntax level. The main trade-off is between peak predictive performance and representations whose signal can be traced back to syntax structure.