Abstract
Formative practice embedded in textbooks has been shown to substantially enhance learning, yet manually authoring high-quality questions at scale is prohibitive. Recent advances in automatic question generation (AQG) have enabled large-scale production of formative practice questions. This study investigates whether a large language model (LLM) can improve selection of textbook sentences for fill-in-the-blank cloze questions by identifying those that lead to higher rates of negative student feedback. A larger LLM was employed to expedite prompt engineering for sentence classification by a smaller model. Over 1.3 million student-question sessions spanning 2,500 textbooks were analyzed using an explanatory logistic regression model. Questions derived from sentences the filter rejects were over three times as likely to receive a thumbs down rating by students, even after controlling for previously established causal factors. This finding indicates that an LLM-based filter can be integrated into existing rule-based AQG pipelines to remove flawed items and raise overall question quality in large-scale educational applications with minimal additional overhead.