Raissa Araminta Khairunnisa (1), Utomo Pujianto (2)
General Background: The integration of Generative AI in educational assessment enables rapid construction of large-scale question banks, particularly in programming education, yet raises concerns regarding content validity. Specific Background: In algorithm and programming domains, Generative AI models frequently assign Higher Order Thinking Skills and Lower Order Thinking Skills labels automatically, creating potential discrepancies with Bloom’s Taxonomy classifications. Knowledge Gap: Empirical evidence validating the reliability of AI-generated cognitive labels and comparing statistical and transformer-based classification methods on small, domain-specific Indonesian datasets remains limited. Aims: This study aims to audit the reliability of cognitive labels generated by the Gemini model through expert validation and to compare TF-IDF–SVM and IndoBERT–SVM classifiers under class-imbalanced conditions. Results: Expert validation revealed substantial mislabeling, with a claimed balanced dataset becoming skewed toward LOTS. Classification experiments using five-fold cross-validation showed that TF-IDF–SVM achieved a slightly higher macro F1-score than IndoBERT–SVM. Novelty: The study demonstrates that simple lexical representations with stemming can outperform transformer-based embeddings when data are limited and domain-specific. Implications: These findings emphasize the necessity of human validation in AI-generated assessments and support the use of lightweight statistical text classification for automated cognitive level evaluation in constrained educational contexts.
• Generative AI cognitive labels showed substantial inconsistency after expert validation• Lexical feature representation yielded higher macro-level classification balance• Human-in-the-loop validation remained essential for programming assessment datasets
HOTS; LOTS; Generative AI; Text Classification; TF-IDF
A. Ahmed, E. Kerr, and A. O’Malley, “Quality assurance and validity of AI-generated single best answer questions,” BMC Medical Education, vol. 25, no. 1, p. 300, Feb. 2025, doi: 10.1186/s12909-025-06881-w.
A. A. Alsagoafi and H. S. Alomran, “Revolutionizing Assessment: Leveraging ChatGPT for Automated Item Generation: An AI Driven Exploratory Study with EFL Teachers,” World Journal of English Language, vol. 15, no. 6, p. 385, July 2025, doi: 10.5430/wjel.v15n6p385.
G. Kurdi, J. Leo, B. Parsia, U. Sattler, and S. Al-Emari, “A Systematic Review of Automatic Question Generation for Educational Purposes,” International Journal of Artificial Intelligence in Education, vol. 30, no. 1, pp. 121–204, Mar. 2020, doi: 10.1007/s40593-019-00186-y.
Y. Susanti, T. Tokunaga, and H. Nishikawa, “Integrating automatic question generation with computerised adaptive test,” Research and Practice in Technology Enhanced Learning, vol. 15, no. 1, p. 9, Dec. 2020, doi: 10.1186/s41039-020-00132-w.
L. O. Wilson and C. Leslie, “Anderson and Krathwohl Bloom’s Taxonomy Revised.”
K. U. Danyaro, S. Abdullahi, A. S. Abdallah, and H. Chiroma, “Hallucinations in Large Language Models for Education: Challenges and Mitigation,” International Journal of Teaching and Learning in Education, vol. 4, no. 6, pp. 13–19, 2025, doi: 10.22161/ijtle.4.6.2.
N. E. Mustaffa, K. E. Lai, C. Preece, and F. Y. Wong, “A Bibliometric Review of Large Language Model Hallucination,” SSRN Electronic Journal, 2025, doi: 10.2139/ssrn.5065151.
M. O. Omopekunola and E. Y. Kardanova, “Automatic generation of physics items with Large Language Models (LLMs),” REID Research and Evaluation in Education, vol. 10, no. 2, pp. 168–185, Oct. 2024, doi: 10.21831/reid.v10i2.76864.
A. Herrmann-Werner et al., “Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study,” Journal of Medical Internet Research, vol. 26, p. e52113, Jan. 2024, doi: 10.2196/52113.
P. Sheridan, Z. Ahmed, and A. A. Farooque, “A Fisher’s Exact Test Justification of the TF–IDF Term-Weighting Scheme,” The American Statistician, pp. 1–11, Sept. 2025, doi: 10.1080/00031305.2025.2539241.
Z. Xu, M. Chen, K. Q. Weinberger, and F. Sha, “An alternative text representation to TF-IDF and Bag-of-Words,” Jan. 28, 2013, arXiv: arXiv:1301.6770, doi: 10.48550/arXiv.1301.6770.
B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” Oct. 08, 2020, arXiv: arXiv:2009.05387, doi: 10.48550/arXiv.2009.05387.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” May 24, 2019, arXiv: arXiv:1810.04805, doi: 10.48550/arXiv.1810.04805.
M. Selvam and R. González Vallejo, “Human-in-the-Loop Models for Ethical AI Grading: Combining AI Speed with Human Ethical Oversight,” EthAIca, vol. 4, p. 413, Aug. 2025, doi: 10.56294/ai2025413.
C. Petridis, “Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models,” Dec. 30, 2024, arXiv: arXiv:2412.21022, doi: 10.48550/arXiv.2412.21022.
S. Chanda and S. Pal, “The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media,” SN Computer Science, vol. 4, no. 5, p. 494, June 2023, doi: 10.1007/s42979-023-01942-7.
Arif Bijaksana Putra Negara, “The Influence Of Applying Stopword Removal And Smote On Indonesian Sentiment Classification,” Lontar Komputer Jurnal Ilmiah Teknologi Informasi, vol. 14, no. 03, pp. 172–185, Oct. 2025, doi: 10.24843/LKJITI.2023.v14.i03.p05.
X. Song, A. Salcianu, Y. Song, D. Dopson, and D. Zhou, “Fast WordPiece Tokenization,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2021, pp. 2089–2103, doi: 10.18653/v1/2021.emnlp-main.160.
A. Nayak, H. Timmapathini, K. Ponnalagu, and V. Gopalan Venkoparao, “Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words,” in Proceedings of the First Workshop on Insights from Negative Results in NLP, Association for Computational Linguistics, 2020, pp. 1–5, doi: 10.18653/v1/2020.insights-1.1.
Y. Wahba, N. Madhavji, and J. Steinbacher, “A Comparison of SVM Against Pre-trained Language Models (PLMs) for Text Classification Tasks,” in Machine Learning, Optimization, and Data Science, Lecture Notes in Computer Science, vol. 13811, Springer Nature Switzerland, 2023, pp. 304–313, doi: 10.1007/978-3-031-25891-6_23.
B. Zhu, X. Jing, L. Qiu, and R. Li, “An Imbalanced Data Classification Method Based on Hybrid Resampling and Fine Cost Sensitive Support Vector Machine,” Computers, Materials & Continua, vol. 79, no. 3, pp. 3977–3999, 2024, doi: 10.32604/cmc.2024.048062.
L. Hakim, A. Sobri, L. Sunardi, and D. Nurdiansyah, “Prediksi penyakit jantung berbasis mesin learning dengan menggunakan metode k-nn,” Jurnal Digital Teknologi Informasi, vol. 7, no. 2, p. 14, Feb. 2025, doi: 10.32502/digital.v7i2.9429.