Vina Najahah (1), Utomo Pujianto (2)
General Background: Determining question difficulty is a fundamental requirement in educational assessment to support valid evaluation and systematic question curation. Specific Background: The increasing use of artificial intelligence for automatic question generation produces large volumes of linguistically diverse items, making manual difficulty labeling time-consuming and subjective. Knowledge Gap: Despite extensive research on text-based difficulty prediction, lightweight and reproducible pipelines for multi-level difficulty classification of AI-generated questions remain limited. Aims: This study aims to develop and evaluate an automatic classification pipeline for three difficulty levels of AI-generated multiple-choice questions using TF-IDF text representation and a Random Forest classifier. Results: The proposed pipeline achieved a test accuracy of 70.98%, exceeding the random guessing baseline, with the highest F1-score observed in the easy class (78.45%) and the lowest in the medium class (65.32%), indicating greater ambiguity in intermediate difficulty questions. Novelty: This study presents a reproducible and interpretable classification workflow specifically applied to expert-labeled AI-generated questions with high inter-rater reliability. Implications: The findings support the use of lexical feature–based classification as an initial pre-curation and difficulty filtering tool in AI-assisted educational assessment systems.
• The classification pipeline distinguishes three difficulty levels using only textual features• Medium difficulty questions exhibit the highest classification ambiguity• Lexical patterns contribute consistently to difficulty level separation
Question Difficulty Classification; AI Generated Questions; TF-IDF Representation; Random Forest Classifier; Educational Assessment
N. Scaria, S. D. Chenna, and D. Subramani, “Automated Educational Question Generation at Different Bloom’s Skill Levels using Large Language Models: Strategies and Evaluation,” vol. 14830, 2024, pp. 165–179. doi: 10.1007/978-3-031-64299-9_12.
I. L. Molina, V. Švábenský, T. Minematsu, L. Chen, F. Okubo, and A. Shimada, “Comparison of Large Language Models for Generating Contextually Relevant Questions,” vol. 15160, 2024, pp. 137–143. doi: 10.1007/978-3-031-72312-4_18.
R. Fulari and J. Rusert, “Utilizing Machine Learning to Predict Question Difficulty and Response Time for Enhanced Test Construction”.
A. D. M. Putri, N. Sulistianingsih, and R. Rismayati, “Pengaruh Teknik Representasi Teks Bag-of-Words dan TF-IDF terhadap Akurasi Klasifikasi Sentimen Teks Multi-Domain,” JTIM J. Teknol. Inf. Dan Multimed., vol. 7, no. 4, pp. 675–688, Oct. 2025, doi: 10.35746/jtim.v7i4.756.
K. Madatov, S. Sattarova, and J. Vičič, “TF-IDF-Based Classification of Uzbek Educational Texts,” Appl. Sci., vol. 15, no. 19, p. 10808, Oct. 2025, doi: 10.3390/app151910808.
Y. Dai, F. Wang, and J. Luo, “Optimal Opacity-Enforcing Supervisory Control of Discrete Event Systems on Choosing Cost,” Appl. Sci., vol. 14, no. 6, p. 2532, Mar. 2024, doi: 10.3390/app14062532.
P. S. Siregar, “Multiple Choice Question Difficulty Level Classification with Multi Class Confusion Matrix in the Online Question Bank of Education Gallery,” J. Appl. Data Sci., vol. 4, no. 4, pp. 392–406, Dec. 2023, doi: 10.47738/jads.v4i4.132.
M. R. Syaputra, M. Arifin, and D. L. Fithri, “Klasifikasi Sentimen Ulasan E-Wallet menggunakan TF-IDF dan Random Forest dengan Penyeimbangan Data SMOTE”.
R. Gupta, R. Aksitov, S. Phatale, S. Chaudhary, H. Lee, and A. Rastogi, “Conversational Recommendation as Retrieval: A Simple, Strong Baseline,” May 23, 2023, arXiv: arXiv:2305.13725. doi: 10.48550/arXiv.2305.13725.
M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochem. Medica, pp. 276–282, 2012, doi: 10.11613/BM.2012.031.
G. I. Kim, S. Kim, and B. Jang, “Classification of mathematical test questions using machine learning on datasets of learning management system questions,” PLOS ONE, vol. 18, no. 10, p. e0286989, Oct. 2023, doi: 10.1371/journal.pone.0286989.
Jubeile Mark Baladjay, Nisce Riva, Ladine Ashley Santos, Dan Michael Cortez, Criselle Centeno, and Ariel Antwaun Rolando Sison, “Performance evaluation of random forest algorithm for automating classification of mathematics question items,” World J. Adv. Res. Rev., vol. 18, no. 2, pp. 034–043, May 2023, doi: 10.30574/wjarr.2023.18.2.0762.
C. Isley et al., “Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study,” Aug. 09, 2025, arXiv: arXiv:2508.08314. doi: 10.48550/arXiv.2508.08314.
A. Yaacoub, J. Da-Rugna, and Z. Assaghir, “Assessing AI-Generated Questions’ Alignment with Cognitive Frameworks in Educational Assessment”.
S. AlKhuzaey, F. Grasso, T. R. Payne, and V. Tamma, “Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches,” Int. J. Artif. Intell. Educ., vol. 34, no. 3, pp. 862–914, Sept. 2024, doi: 10.1007/s40593-023-00362-1.
L. Zotos, I. P. de Jong, M. Valdenegro-Toro, A. I. Sburlea, M. Nissim, and H. van Rijn, “NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty,” 2025, arXiv. doi: 10.48550/ARXIV.2508.03294.
R. S. Perdana and P. P. Adikara, “Multi-task Learning for Named Entity Recognition and Intent Classification in Natural Language Understanding Applications,” J. Inf. Syst. Eng. Bus. Intell., vol. 11, no. 1, pp. 1–16, Mar. 2025, doi: 10.20473/jisebi.11.1.1-16.
M. Li, Q. Gao, and T. Yu, “Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters,” BMC Cancer, vol. 23, no. 1, p. 799, Aug. 2023, doi: 10.1186/s12885-023-11325-z.
M. Méndez, M. G. Merayo, and M. Núñez, “Design of hybrid machine learning and TF-IDF models to discard irrelevant reviews on public transport stations,” J. Inf. Telecommun., vol. 9, no. 4, pp. 481–504, Oct. 2025, doi: 10.1080/24751839.2025.2472503.
P. Guleria, J. Frnda, and P. N. Srinivasu, “NLP based text classification using TF-IDF enabled fine-tuned long short-term memory: An empirical analysis,” Array, vol. 27, p. 100467, Sept. 2025, doi: 10.1016/j.array.2025.100467.