Abstract
Automatic question generation (AQG) systems that apply AI to create formative practice items at scale have proven to be effective for providing students a learning by doing approach in etextbook environments, overcoming the barriers to scaling this learning science-based method. This paper investigates whether a large language model (LLM) can improve the selection of answer terms in fill-in-the-blank questions (currently selected via a rule-based system) by focusing on both domain relevance and sentence-level nuances. Drawing on a dataset of over 1.3 million student-question sessions, an explanatory logistic regression model tested the causal hypothesis that, conditional on pre-treatment features, questions for which the LLM and the rule-based system agree on the answer blank will receive more favorable ratings from students. Results reveal that agreement corresponds to a 31% decrease in likelihood of a thumbs down rating, controlling for previously identified causal factors. Rather than replacing the rule-based system, LLM-rule agreement serves as a signal for questions students perceive as higher quality. These findings offer initial evidence that incorporating an LLM-based agreement filter into an established AQG pipeline can enhance question quality while preserving factual accuracy.