Formative assessment is a cornerstone of effective teaching and learning, providing students with feedback to guide their learning. While there has been an exponential growth in the application of generative AI in scaling various aspects of formative assessment, ranging from automatic question generation to intelligent tutoring systems and personalized feedback, few have directly addressed the core pedagogical principles of formative assessment. Here, we critically examined how generative AI, especially large-language models (LLMs) such as ChatGPT, can support key components of formative assessment: helping students, teachers, and peers understand "where learners are going," "where learners currently are," and "how to move learners forward" in the learning process. With the rapid emergence of new prompting techniques and LLM capabilities, we also provide guiding principles for educators to effectively leverage cost-free LLMs in formative assessments while remaining grounded in pedagogical best practices. Furthermore, we reviewed the role of LLMs in generating feedback, highlighting limitations in current evaluation metrics that inadequately capture the nuances of formative feedback
This paper presents VITA (Virtual Teaching Assistants), an adaptive distributed learning (ADL) platform that embeds a large language model (LLM)-powered chatbot (BotCaptain) to provide dialogic support, interoperable analytics, and integrity-aware assessment for workforce preparation in data science. The platform couples context-aware conversational tutoring with formative-assessment patterns designed to promote reflective reasoning. The paper describes an end-to-end data pipeline that transforms chat logs into Experience API (xAPI) statements, instructor dashboards that surface outliers for just-in-time intervention, and an adaptive pathway engine that routes learners among progression, reinforcement, and remediation content. The paper also benchmarks VITA conceptually against emerging tutoring architectures, including retrieval-augmented generation (RAG)--based assistants and Learning Tools Interoperability (LTI)--integrated hubs, highlighting trade-offs among content grounding, interoperability, and deployment complexity. Contributions include a reusable architecture for interoperable conversational analytics, a catalog of patterns for integrity-preserving formative assessment,
Model misspecification of formative indicators remains a widely documented issue across academic literature, yet scholars lack a clear consensus on pragmatic, prescriptive approaches to manage this gap. This ambiguity forces researchers to rely on psychometric frameworks primarily intended for reflective models, and thus risks misleading findings. This article introduces a Multi-Step Validation Methodology Framework specifically designed for formative constructs in survey-based research. The proposed framework is grounded in an exhaustive literature review and integrates essential pilot diagnostics through descriptive statistics and multicollinearity checks. The methodology provides researchers with the necessary theoretical and structural clarity to finally justify and adhere to appropriate validation techniques that accurately account for the causal nature of the constructs while ensuring high psychometric and statistical integrity.
This paper explores the development and adoption of AI-based formative feedback in the context of biweekly reports in an engineering Capstone program. Each student is required to write a short report detailing their individual accomplishments over the past two weeks, which is then assessed by their advising professor. An LLM-powered tool was developed to provide students with personalized feedback on their draft reports, guiding them toward improved completeness and quality. Usage data across two rounds revealed an initial barrier to adoption, with low engagement rates. However, students who engaged in the AI feedback system demonstrated the ability to use it effectively, leading to improvements in the completeness and quality of their reports. Furthermore, the tool's task-parsing capabilities provided a novel approach to identify potential student organizational tasks and deliverables. The findings suggest initial skepticism toward the tool with a limited adoption within the studied context, however, they also highlight the potential for AI-driven tools to provide students and professors valuable insights and formative support.
Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance gradin
The ability to revise essays in response to feedback is important for students' writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.
Most currently accepted approaches to evaluating Research through Design (RtD) presume that design prototypes are finalized and ready for robust testing in laboratory or in-the-wild settings. However, it is also valuable to assess designs at intermediate phases with mid-fidelity prototypes, not just to inform an ongoing design process, but also to glean knowledge of broader use to the research community. We propose 'formative situations' as a frame for examining mid-fidelity prototypes-in-process in this way. We articulate a set of criteria to help the community better assess the rigor of formative situations, in the service of opening conversation about establishing formative situations as a valuable contribution type within the RtD community.
Formative assessment in STEM topics aims to promote student learning by identifying students' current understanding, thus targeting how to promote further learning. Previous studies suggest that the assessment performance of current generative large language models (LLMs) on constructed responses to open-ended questions is significantly lower than that of supervised classifiers trained on high-quality labeled data. However, we demonstrate that concept-based rubrics can significantly enhance LLM performance, which narrows the gap between LLMs as off-the shelf assessment tools, and smaller supervised models, which need large amounts of training data. For datasets where concept-based rubrics allow LLMs to achieve strong performance, we show that the concept-based rubrics help the same LLMs generate high quality synthetic data for training lightweight, high-performance supervised models. Our experiments span diverse STEM student response datasets with labels of varying quality, including a new real-world dataset that contains some AI-assisted responses, which introduces additional considerations.
Medical education faces challenges in providing scalable, consistent clinical skills training. Simulation with standardized patients (SPs) develops communication and diagnostic skills but remains resource-intensive and variable in feedback quality. Existing AI-based tools show promise yet often lack comprehensive assessment frameworks, evidence of clinical impact, and integration of self-regulated learning (SRL) principles. Through a multi-phase co-design process with medical education experts, we developed MedSimAI, an AI-powered simulation platform that enables deliberate practice through interactive patient encounters with immediate, structured feedback. Leveraging large language models, MedSimAI generates realistic clinical interactions and provides automated assessments aligned with validated evaluation frameworks. In a multi-institutional deployment (410 students; 1,024 encounters across three medical schools), 59.5 percent engaged in repeated practice. At one site, mean Objective Structured Clinical Examination (OSCE) history-taking scores rose from 82.8 to 88.8 (p < 0.001, Cohen's d = 0.75), while a second site's pilot showed no significant change. Automated scoring achi
The workforce will need to continually upskill in order to meet the evolving demands of industry, especially working with robotic and autonomous systems. Current training methods are not scalable and do not adapt to the skills that learners already possess. In this work, we develop a system that automatically assesses learner skill in a quadrotor teleoperation task using temporal logic task specifications. This assessment is used to generate multimodal feedback based on the principles of effective formative feedback. Participants perceived the feedback positively. Those receiving formative feedback viewed the feedback as more actionable compared to receiving summary statistics. Participants in the multimodal feedback condition were more likely to achieve a safe landing and increased their safe landings more over the experiment compared to other feedback conditions. Finally, we identify themes to improve adaptive feedback and discuss and how training for complex psychomotor tasks can be integrated with learning theories.
This formative study investigates the impact of data quality on AI-assisted data visualizations, focusing on how uncleaned datasets influence the outcomes of these tools. By generating visualizations from datasets with inherent quality issues, the research aims to identify and categorize the specific visualization problems that arise. The study further explores potential methods and tools to address these visualization challenges efficiently and effectively. Although tool development has not yet been undertaken, the findings emphasize enhancing AI visualization tools to handle flawed data better. This research underscores the critical need for more robust, user-friendly solutions that facilitate quicker and easier correction of data and visualization errors, thereby improving the overall reliability and usability of AI-assisted data visualization processes.
In team sports, effective tactical communication is crucial for success, particularly in the fast-paced and complex environment of outdoor athletics. This paper investigates the challenges faced in transmitting strategic plans to players and explores potential solutions using eXtended Reality (XR) technologies. We conducted a formative study involving interviews with 4 Division I professional soccer coaches, 4 professional players, 2 college club coaches, and 2 college club players, as well as a survey among 17 Division I players. The study identified key requirements for tactical communication tools, including the need for rapid communication, minimal disruption to game flow, reduced cognitive load, clear visualization for all players, and enhanced auditory clarity. Based on these insights, we propose a potential solution - a Mobile Augmented Reality (AR) system designed to address these challenges by providing real-time, intuitive tactical visualization and communication. The system aims to improve strategic planning and execution, thereby enhancing team performance and cohesion. This work represents a significant step towards integrating XR technologies into sports coaching, off
Sex education helps children obtain knowledge and awareness of sexuality, and protects them against sexually transmitted diseases, pregnancy, and sexual abuse. Sex education is not well taught to children in China -- both school-based education and parental communication on this topic are limited. To interrogate the status quo of sex education in China and explore suitable interventions, we conducted a series of formative studies including interviews and social media analysis. Multiple stakeholders such as children, parents, education practitioners, and the general public were engaged for an in-depth understanding of their unique needs regarding teaching and learning sex education. We found that school-based sex education for Chinese children was currently insufficient and restrictive. Involving parents in sex education posed several challenges, such as a lack of sexuality and pedagogy knowledge, and embarrassment in initiating sex education conversations. Culture and politics were major hurdles to effective sex education. Based on the findings, we reflect on the complex interactions between culture, politics, education policy, and pedagogy, and discuss situated design of sex educa
This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.
Automated methods are becoming increasingly integrated into studies of formative feedback on students' science explanation writing. Most of this work, however, addresses students' responses to short answer questions. We investigate automated feedback on students' science explanation essays, where students must articulate multiple ideas. Feedback is based on a rubric that identifies the main ideas students are prompted to include in explanatory essays about the physics of energy and mass, given their experiments with a simulated roller coaster. We have found that students generally improve on revised versions of their essays. Here, however, we focus on two factors that affect the accuracy of the automated feedback. First, we find that the main ideas in the rubric differ with respect to how much freedom they afford in explanations of the idea, thus explanation of a natural law is relatively constrained. Students have more freedom in how they explain complex relations they observe in their roller coasters, such as transfer of different forms of energy. Second, by tracing the automated decision process, we can diagnose when a student's statement lacks sufficient clarity for the automat
Delivering personalised, formative feedback to multiple problem-based learning groups in a short time period can be almost impossible. We employed ChatGPT to provide personalised formative feedback in a one-hour Zoom break-out room activity that taught practicing health professionals how to formulate evaluation plans for digital health initiatives. Learners completed an evaluation survey that included Likert scales and open-ended questions that were analysed. Half of the 44 survey respondents had never used ChatGPT before. Overall, respondents found the feedback favourable, described a wide range of group dynamics, and had adaptive responses to the feedback, yet only three groups used the feedback loop to improve their evaluation plans. Future educators can learn from our experience including engineering prompts, providing instructions on how to use ChatGPT, and scaffolding optimal group interactions with ChatGPT. Future researchers should explore the influence of ChatGPT on group dynamics and derive design principles for the use of ChatGPT in collaborative learning.
Effectively supporting students in mastering all facets of self-regulated learning is a central aim of teachers and educational researchers. Prior research could demonstrate that formative feedback is an effective way to support students during self-regulated learning (SRL). However, for formative feedback to be effective, it needs to be tailored to the learners, requiring information about their learning progress. In this work, we introduce LEAP, a novel platform that utilizes advanced large language models (LLMs), such as ChatGPT, to provide formative feedback to students. LEAP empowers teachers with the ability to effectively pre-prompt and assign tasks to the LLM, thereby stimulating students' cognitive and metacognitive processes and promoting self-regulated learning. We demonstrate that a systematic prompt design based on theoretical principles can provide a wide range of types of scaffolds to students, including sense-making, elaboration, self-explanation, partial task-solution scaffolds, as well as metacognitive and motivational scaffolds. In this way, we emphasize the critical importance of synchronizing educational technological advances with empirical research and theore
The contribution provides a detailed exploration of the online platform Perusall as an advanced social annotation technology in teaching and learning STEM disciplines. This exploration is based on the authors' insights and experiences from three years of implementing Perusall at P.J. Šafárik University in Košice, Slovakia. While the concept of social annotation technology and its educational applications are not novel, Perusall's advanced features, including AI and data science reports, enable its use in both synchronous and asynchronous blended and flipped learning environments. In this context, Perusall serves as a digital tool for collecting formative data, monitoring student progress, and identifying areas of difficulty. This assessment data can be effectively utilized in preparing and personalizing subsequent face-to-face group interactions, thereby enhancing and improving the learning experience. From a pedagogical viewpoint, Perusall's role was particularly significant during the Covid-19 pandemic, enabling effective, continuous, and engaging learning amidst social distancing and physical restrictions. Today, Perusall has become a key tool in blended learning, facilitating h
Technology development practices in industry are often primarily focused on business results, which risks creating unbalanced power relations between corporate interests and the needs or concerns of people who are affected by technology implementation and use. These practices, and their associated cultural norms, may result in uses of technology that have direct, indirect, short-term, and even long-term negative effects on groups of people and/or the environment. This paper contributes a formative framework -- the Responsible and Inclusive Technology Framework -- that orients critical reflection around the social contexts of technology creation and use; the power dynamics between self, business, and societal stakeholders; the impacts of technology on various communities across past, present, and future dimensions; and the practical decisions that imbue technological artifacts with cultural values. We expect that the implementation of the Responsible and Inclusive Technology framework, especially in business-to-business industry settings, will serve as a catalyst for more intentional and socially-grounded practices, thus bridging the responsibility and principles-to-practice gap.
The dissemination of scholarly research is critical, yet researchers often lack the time and skills to create engaging content for popular media such as short-form videos. To address this gap, we explore the use of generative AI to help researchers transform their academic papers into accessible video content. Informed by a formative study with science communicators and content creators (N=8), we designed PaperTok, an end-to-end system that automates the initial creative labor by generating script options and corresponding audiovisual content from a source paper. Researchers can then refine based on their preferences with further prompting. A mixed-methods user study (N=18) and crowdsourced evaluation (N=100) demonstrate that PaperTok's workflow can help researchers create engaging and informative short-form videos. We also identified the need for more fine-grained controls in the creation process. To this end, we offer implications for future generative tools that support science outreach.