共找到 20 条结果
The effectiveness of AI debugging follows a predictable exponential decay pattern; most models lose 60-80% of their debugging capability within just 2-3 attempts, despite iterative debugging being a critical capability for practical code generation systems. We introduce the Debugging Decay Index (DDI), a mathematical framework that quantifies when debugging becomes ineffective and predicts intervention points. Our strategic fresh start approach shifts from exploitation to exploration at strategic points in the debugging process, demonstrating that well-timed interventions can rescue the effectiveness of debugging. DDI reveals a fundamental limitation in current AI self-debugging and provides the first systematic metric to gauge LLM-based code generation.
Creating publication-quality visualizations is essential for bioinformatics but remains a bottleneck for researchers with limited coding expertise. While Large Language Models (LLMs) are proficient at generating code, they often fail in practice due to library dependencies, dataset mismatches, or syntax errors. These issues require manual intervention, slowing data interpretation. We present ggplotAgent, a novel multi-modal, self-debugging artificial intelligence agent that automates publication-ready ggplot2 visualizations. It features a dual-layered framework that resolves code execution errors and uses a vision-enabled agent to verify aesthetic correctness. In benchmarks against the DeepSeek-V3 model, ggplotAgent achieved a 100% code executability rate(versus 85%) and a "Publication-Ready" score of 1.9 (versus 0.7). Surprisingly, it showcased the ability to act as an expert collaborator by intelligently enhancing plots beyond the user's literal prompt, achieving a positive Insight Score of +0.3 over than the baseline (-0.05). These results demonstrate its ability to reliably produce accurate, high-quality visualizations directly from natural language. ggplotAgent is freely accessible as a public web application at https://ggplotagent.databio1.com/ and an offline Streamlit app. The source code is available on GitHub at https://github.com/charlin90/ggplotAgent. This software is distributed under the MIT License.
Producing second-generation ethanol from lignocellulosic hydrolysates (LCHs) poses significant challenges for Saccharomyces cerevisiae due to the presence of fermentation inhibitors. Quantitative trait loci (QTL) mapping of stress-tolerant S. cerevisiae strains is important for identifying adaptive alleles that can enhance yeast fermentation of LCHs. However, the QTL mapping process is labor-intensive, requiring the screening of numerous recombinants and repeated crossings to improve mapping resolution. We developed Reiterated Mass Selection and backcrosSing (ReMaSSing) to facilitate the identification of adaptive alleles through QTL mapping and to enhance LCH tolerance in yeast strains. ReMaSSing was applied to populations obtained by crossing the stress-resistant yeast PE-2_H4 with the laboratory strain S288C. Using alternative protocols, we selected haploid or diploid populations with dominant markers, enriching millions of segregants carrying adaptive alleles by propagating them in standard or LCH-supplemented media. The enriched pools were then bulk backcrossed with S288C, and germination of millions of spores generated new recombinant populations for subsequent selection cycles. After five rounds of ReMaSSing, whole-genome sequencing and QTL mapping identified key alleles associated with LCH tolerance, linked to VPS70, CAT5, GCY1, UBP2, MKT1/SAL1, HAP1, and PHO84, which influence growth and mitochondrial function in S288C. Mutations in IRA1 and HTA1, unique to our S288C strain, were also mapped, highlighting ReMaSSing's ability to "debug" the S288C background, i.e., to purge detrimental variants through selection. Allele swapping and competition assays confirmed that the identified QTL improved LCH tolerance and growth, with strains combining adaptive alleles performing over 20% better than the parental S288C. Finally, applying ReMaSSing to breed an LCH-tolerant yeast with a xylose-consuming strain produced recombinants with improved fermentation of xylose-enriched LCH. ReMaSSing offers a practical protocol for generating QTL mapping populations to identify adaptive alleles in tolerant strains and correct genetic defects in inferior ones. Notably, recombinant populations and clones derived from ReMaSSing outperformed both parental strains in LCH tolerance and growth. Furthermore, we applied ReMaSSing to breed strains with enhanced LCH tolerance, efficient xylose catabolism, and robust ethanol production. Together, these results demonstrate that ReMaSSing is a powerful tool for engineering industrial yeast strains that integrate desirable traits from multiple parental backgrounds.
Many software reliability growth models (SRGMs) have been proposed by researchers within the context of probability theory to estimate software reliability, remaining number of faults and optimal release time. The Fault Detection Rate (FDR) may vary because of changes in testing strategies. Due to lack of knowledge of software code, the testing team might be unable to rectify the detected faults thereby introducing new faults during the fault correction process. The debugging process is imperfect due to factors like human error, insufficient testing and complex codes resulting in epistemic uncertainty. In this paper, we have proposed a new software belief reliability growth model (SBRGM) using uncertain differential equations to deal with epistemic uncertainty effectively. We have incorporated imperfect debugging and change point based on the approach of belief reliability theory, making this model more accurate as compared to some of the previously developed models. Model parameters estimation methodology is derived using the least square method and Python version 3.10. Calculation of change point is done using empirical data analysis based on the First principle of Derivatives. Three real data sets have been used to validate the proposed model. This research contributes to being more flexible and realistic in dealing with epistemic uncertainty effectively as compared to conventional approaches.
In recent years, generative artificial intelligence (GenAI) tools such as ChatGPT have been increasingly integrated into academic reading in higher education. Although GenAI can support processing complex academic texts, its effective use requires learners to employ metacognitive strategies to avoid uncritical reliance. However, how second language (L2) learners use such strategies in GenAI-supported academic reading remains underexplored. Situated in UK higher education, this qualitative study examines how 12 postgraduate L2 students employ metacognitive strategies when using ChatGPT for English academic reading. Data from interviews and retrospective reflections were thematically analyzed, while chat logs were used as supplementary descriptive evidence. The findings identify five categories of metacognitive strategies, namely planning, monitoring, evaluating, information management, and debugging. While many strategies align with prior academic reading research, others are specific to the GenAI context, particularly debugging practices such as correcting GenAI errors and developing personalized prompt templates. Differences were also observed across learners with varying language proficiency, especially in verification and prompt refinement behaviors. This study contributes by providing a qualitative account of metacognitive regulation in GenAI-supported academic reading and extending metacognitive strategy frameworks to GenAI-mediated learning environments.
Introduction Understanding the three-dimensional (3D) geometric relationship between extraocular muscles and the globe is essential for strabismus management. Conventional educational tools are static, and existing 3D biomechanical software requires highly specialized skills, making routine clinical use difficult. Furthermore, image-generating artificial intelligence (AI) frequently produces anatomically incorrect outputs (hallucinations). This study aimed to develop a structurally coherent, interactive 3D eye movement schematic as a proof-of-concept, using the coding capabilities of a large language model (LLM). Methods We used an LLM to generate web-based 3D schematic code (HTML and JavaScript/Three.js) exclusively through natural language dialogue. To prevent anatomical errors, we explicitly defined anatomical parameters based on standard literature (e.g., 12-mm scleral radius) and employed mathematical constraints, including quaternions for rotation and spherical linear interpolation for muscle paths, within the prompts. The generated code was rendered in a web browser, and an iterative process of prompt refinement and debugging was conducted until two board-certified ophthalmologists confirmed the schematic's structural validity. Results A functional, interactive 3D eye movement schematic was successfully developed. In our 10-trial evaluation, generating an acceptable schematic required an average of 7.4 prompt inputs per session. While complex spatial instructions had a lower success rate (40%) due to AI hallucinations, iterative prompt repetition and specific local debugging instructions resolved these issues. The final schematic provided a structurally coherent representation of the globe, the four rectus muscles, and the annulus of Zinn. It featured a slider interface enabling real-time, kinematic visualizations of eye rotations, muscle deformations, and optic nerve bending without structural failure. Conclusions Translating anatomical descriptions into mathematical spatial logic via LLMs enables the creation of structurally sound 3D medical schematics. This logical spatial construction approach democratizes the development of interactive educational tools. It allows healthcare providers without programming expertise to intuitively generate customizable 3D educational materials for patient consultations and foundational medical education through natural language dialogue.
The Hefei Light Source-II (HLS-II) facility is undergoing upgrades to its beamline and endstation to enhance its synchrotron radiation capabilities. For the X-ray Magnetic Circular Dichroism endstation, an experimental control system leveraging Bluesky and Experimental Physics and Industrial Control System has been developed. This control system implements three distinct scan modes: step scan, fly scan and soft fly scan. The system architecture employs Queue Server for automated experiment orchestration and Ophyd based device abstraction, enabling seamless adaptation across multiple HLS-II beamlines. During software development, three key system features are included: (1) a simulation debugging environment combining a virtual server platform, virtual devices and virtual input/output controllers for closed-loop testing; (2) unified control interfaces through component abstraction; and (3) real-time data acquisition with noise-reduction methods. The implemented soft fly scan mode demonstrates particular advantages in experiments, achieving a tenfold reduction in measurement duration while maintaining signal-to-noise ratio levels comparable to conventional step scan, significantly enhancing both experimental efficiency and data quality for the user.
Access to hands-on PLC training is often limited by the cost and complexity of physical automation laboratories, while existing simulation tools typically lack alignment with real hardware configurations, reducing their effectiveness for education. To address this, we present a modular PLC simulation method that enables accurate virtual replication of a cost-effective, scalable industrial automation laboratory used for traffic light control, elevator operation, and automated filling systems. Built in Unity, the method integrates a custom ladder logic execution engine with interactive 3D models that mirror the exact input/output structure and operational behavior of the physical laboratory. Users can program, test, and debug logic in a realistic environment and receive immediate visual feedback-without requiring hardware. The method was validated by comparing its outputs against the physical system across 4 representative automation tasks; in every case, the virtual and physical setups produced I/O sequences matching within ±10ms and control outcomes, confirming functional equivalence.•Introduces a modular simulation framework that faithfully replicates the application scope of a physical low-cost PLC training laboratory.•Combines a custom ladder logic interpreter with real-time 3D visualization in Unity to enable program testing and debugging.•Validates functional equivalence through direct behavioral comparison with physical hardware across 4 standard automation tasks.
Background: Intimate partner violence (IPV) represents a major public health problem in Europe, with significant physical, psychological, and social consequences. Nurses are often the first professionals capable of detecting early signs of IPV, yet they lack validated instruments to assess their clinical competency in detection, evaluation, documentation, and intervention. This study aimed to develop and validate the Intimate Partner Violence Nursing Competency Scale (IPVNCS), aligned with the Nursing Intervention Classification (NIC 6403). Methods: A cross-sectional psychometric study was conducted among registered nurses in the Community of Madrid. A 30-item Likert-type self-administered instrument (1-5 scale) was developed based on NANDA, NIC 6403, and NOC frameworks. A total of 202 nurses participated. Reliability was assessed through Cronbach's alpha. Construct validity was examined using exploratory factor analysis (EFA) with Promax rotation and confirmatory factor analysis (CFA) using AMOS 26. Ethical approval was obtained (CEU San Pablo, code 843/24/104). Results: After item refinement, 26 items remained across four dimensions: (1) Intervention and Referral, (2) Detection and Assessment, (3) Documentation and Recording-keeping, (4) Psychosocial Support. The instrument showed excellent reliability (α = 0.97). KMO was 0.947 and Bartlett's test was significant (p < 0.001). CFA demonstrated satisfactory fit: χ2/df = 2.066, RMSEA = 0.073, CFI = 0.92, TLI = 0.91, NFI = 0.86. The final model adequately represented the latent structure. After debugging, its psychometric properties were significantly improved. Four redundant items were eliminated, achieving internal consistency (α = 0.97), a KMO value of 0.947 and a significant Bartlett's test of sphericity. It showed a better fit, according to χ2/df = (2.066); Parsimony = (720.736); RMR (0.0529; RMSEA (0.073); NFI (0.860); TLI (0.910) and CFI (0.920). The final model provides an adequate representation of the latent structure of the data. This study provides initial evidence of construct validity and internal consistency reliability of the IPVNCS. Conclusions: The IPVNCS is a valid and reliable tool to assess nursing competencies for clinical management of IPV. It supports structured evaluation across four core nursing domains, enabling improved educational planning, clinical decision-making, and quality of care for victims. The scale fills a gap in clinical nursing assessment tools and can support protocol development in emergency, primary care, and hospital settings.
Code coverage-guided unit test generation (CGTG) and large language model-based test generation (LLMTG) are two principal approaches for the generation of unit tests. Each of these approaches has its inherent advantages and drawbacks. Tests generated by CGTG have been shown to exhibit high code coverage and high executability. However, they lack the capacity to comprehend code intent, which results in an inability to identify deviations between code implementation and design intent (i.e., functional defects). Conversely, although LLMTG demonstrates an advantage in terms of code intent analysis, it is generally characterized by low executability and necessitates iterative debugging. In order to enhance the ability of unit test generation to identify functional defects, a novel framework has been proposed, entitled the intent analysis-guided unit test generation and refinement (IGTG&R) model. The IGTG&R model consists of a two-stage process for test generation. In the first stage, we introduce coverage path entropy to enhance CGTG to achieve high executability and code coverage of test cases. The second stage refines the test cases using LLMs to identify functional defects. We quantify and verify the interference of incorrect code implementation on intent analysis through conditional entropy. In order to reduce this interference, the focal method body is excluded from the code context information during intent analysis. Using these two-stage process, IGTG&R achieves a more profound comprehension of the intent of the code and the identification of functional defects. The IGTG&R model has been demonstrated to achieve an identification rate of functional defects ranging from 65% to 89%, with an execution success rate of 100% and a code coverage rate of 75.8%. This indicates that IGTG&R is superior to the CGTG and LLMTG approaches in multiple aspects.
The Taishan Antineutrino Observatory (TAO) is a high-energy resolution reactor antineutrino experiment designed to measure the fine structure of the reactor antineutrino energy spectrum. It employs silicon photomultipliers (SiPMs) to detect photons produced by secondary particles from antineutrino interactions in a gadolinium-doped liquid scintillator. The physics event rate of the TAO is ∼520 Hz. However, the use of 4024 SiPM arrays results in a high dark noise event rate, leading to a total event rate of up to 1 GHz. This presents a significant challenge in the trigger system design: how to accurately and efficiently select rare effective physics events in real-time amidst a vast amount of noise. This paper introduces a fully digital hardware trigger system. The system features a flexible, reconfigurable two-level processing architecture, combined with a real-time triggering algorithm based on the multiplicity trigger criterion. The trigger system has been tested with the simulation data, and a preliminary joint test with the detector system has been completed. The results of the simulation test with a single module suggest that the trigger system can accurately extract the 1 kHz simulation physics events from the substantial amount of dark noise and upload the triggered data to the DAQ system. Besides, in the preliminary joint test, the trigger system accurately extract the given effective physics event data while compressing the hit rate of dark noise from 2 MHz to 500 Hz. The trigger system has been successfully installed and deployed at the TAO experimental site. It has undergone integrated debugging with the full-scale detector and Front-End Electronics (FEC), and preliminary data acquisition tests have been completed. The design objectives of the triggering system have been fulfilled, demonstrating its correctness and reliability in practical application scenarios.
An efficient software framework for analyzing driver behavior from videos is presented. The system automatically detects driver activities, such as phone use, identifies glance patterns, and determines the vehicle’s automation state, achieving over 96% accuracy for glance and mode detection. For transportation and safety researchers, vehicle engineers, and fleet safety managers, this tool drastically reduces the time and cost of manual video analysis by over 90%. This efficiency enables large-scale naturalistic driving studies, the objective quantification of driver distraction and inattention in commercial and passenger fleets, and the safety evaluation of new in-vehicle information systems and advanced driver-assistance systems. The framework provides an accessible method for understanding real-world driver behavior, ultimately helping to design safer vehicle systems and inform evidence-based road safety policies.TECHNICAL ABSTRACT Background The increasing prevalence of Level 2 (L2) automated driving systems introduces complex human-machine interactions, making the analysis of driver behavior critical for safety. Traditional research methods rely on manual video annotation, a time-consuming and resource-intensive process that limits the scale of naturalistic driving studies and creates a bottleneck for research.Purpose This study aimed to develop and validate a robust, open-source, and integrated framework for the efficient, multi-faceted analysis of driver behavior using standard in-vehicle video recordings, thereby making large-scale analysis more accessible.Methods The framework integrates several pre-trained computer vision models. It uses YOLOEv8 for object detection and MediaPipe for skeletal and facial landmark tracking to classify physical secondary tasks (e.g., phone use, consuming). Driver attentional state is determined by classifying glance zones based on head pose and eye gaze estimation. Driving mode (Manual vs. Automated) is detected using GPU-accelerated OpenCV to analyze dashboard iconography. The entire process is managed through a custom graphical user interface with comprehensive debugging tools.Results Validation against manually annotated data confirmed the framework’s high accuracy. Activity detection achieved F1-scores of 93.3% for consuming, 90.9% for browsing on a phone, and 87.5% for talking on a phone. Driving mode and glance allocation detection achieved mean accuracies of 97.4 and 96.5%, respectively.Conclusion The proposed framework is a validated, efficient, and replicable alternative to manual coding. By significantly reducing analytical workload, it provides an accessible tool for conducting scalable research into driver distraction, attention, and human-automation interaction.
Mental health stigma (MHS) presents a significant barrier to help-seeking and adversely affects the quality of life and support for adolescents with mental health difficulties, yet culturally adapted assessment tools for adolescents in China remain scarce. The objectives were to translate the Peer Mental Health Stigmatization Scale (PMHSS) into Chinese and evaluate its psychometric properties, including reliability and criterion validity. The Chinese version of PMHSS (C-PMHSS) was developed through forward and backward translation, synthesis, comparison and cross-cultural debugging. Psychometric properties were evaluated in a stratified cluster sample of 530 adolescents (13-18 years). Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) assessed structural validity, while reliability was tested through Cronbach's alpha and test-retest correlations. Factor analyzes confirmed a two-factor negative subscale (57.74% variance; χ²/df = 2.303, RMSEA = 0.071) and trifactorial positive subscale (70.32% variance; χ²/df = 2.143, RMSEA = 0.066). The C-PMHSS demonstrated strong internal consistency (α = 0.83) and test-retest reliability (r = 0.86). Significant MHS variations emerged across age, grade, and sex (p < 0.001). The C-PMHSS demonstrates robust psychometric properties, establishing itself as the first psychometrically validated Chinese instrument for early MHS identifying among adolescents. It holds promise for the early identification of adolescents with elevated mental health stigma and for guiding tailored interventions to reduce MHS and promote adolescent mental wellbeing.
In end-of-life care, nurses’ ability to make sound ethical decisions is critical to safeguarding patients’ dignity and quality of life. However, China still lacks a measurement tool tailored to this specific context for assessing such competence. This study, therefore, aimed to localize the Nurses’ Ethical Decision-Making around End-of-Life Care Scale (NEDM-EOLCS) into Chinese and to examine its psychometric characteristics among Chinese nurses. A cross-sectional study design was conducted. The study was conducted between October 2024 and December 2024. The Chinese version of the NEDM-EOLCS scale was initially developed using the Brislin translation model, cross-cultural debugging, and a pre-survey tailored to Chinese linguistic and cultural contexts. 450 nurses completed the Chinese version of the NEDM-EOLCS scale. Exploratory factor analysis (EFA) was employed to analyse the data from Group 1 (n = 225) in order to elucidate the factor structure, whereas Group 2 data (n = 225) were subjected to confirmatory factor analysis (CFA) to verify the model’s suitability; additionally, convergent validity, discriminant validity, and reliability tests were carried out. A total of 450 nurses participated in the survey. The Scale-level Content Validity Index (S-CVI) was 0.98; EFA extracted three factors and explained 61.816% of the total variance; CFA confirmed that all the goodness-of-fit indices were acceptable. The Cronbach’s alpha of the Chinese version of the NEDM-EOLCS was 0.962, and the retest reliability coefficient was 0.896. The Chinese version of NEDM-EOLCS had 55 items in 3 dimensions. The Chinese version of NEDM-EOLCS is scientifically reasonable and has good reliability and validity. It can be used to investigate Chinese nurses’ ethical decision-making around end-of-life care.
In the gig economy dominated by algorithmic control, online labor platforms tend to reduce work resources while increasing job demands, leading gig workers to face weakened agency and declining job satisfaction. This study integrates adaptive structuration theory and job crafting theory to propose the concept of adaptive job recrafting for gig workers who perceived algorithmic control, systematically exploring its connotations and impact mechanisms through a multi-stage mixed-methods research design. First, the grounded theory method is applied to construct a four-dimensional model of adaptive job recrafting comprising algorithm task debugging, socio-technical mutual construction, collaborative skill expansion, and identity cognition evolution. Subsequently, a 19-item measurement instrument was developed following rigorous scale development procedures, with reliability and validity confirmed through empirical testing. Finally, based on the Job Demands-Resources model, a longitudinal survey was conducted to empirically test the positive impact mechanism of adaptive job recrafting on job satisfaction, revealing the partial mediating role of work engagement and the moderating role of psychological capital. This study transcends the limitations of traditional job crafting theory by unveiling the adaptive mechanisms of human-technology interaction under algorithmic control. The practical implications provide evidence for platforms to optimize algorithmic design (such as reserving crafting spaces and incorporating performance evaluation) and for gig workers to construct systematic crafting strategies, helping to achieve a dynamic balance between control and empowerment.
Histopathological tissue reveals natural radial and bilateral symmetry in glandular structures, which becomes progressively disrupted during malignant transformation. Leveraging this observation, this work presents a VGG16-based deep learning model enriched with symmetry-aware interpretation for early detection of Colon Adenocarcinoma. The traditional approaches are not straightforward enough and acts as "black boxes" diminishing their clinical adoption and acceptance in real-world scenario. Current research work uses the most recent breakthroughs in deep learning on medical imaging and integrates Explainable AI strategies such as LIME, SHAP, and Grad-CAM into the model to interpret how cancer-induced symmetry distortions influence model decisions. This work is experimented on a balanced dataset of 10,000 histopathological scans, including 5,000 Colon Adenocarcinoma tissue samples and 5,000 Benign Colon Tissue samples. This research aims to shed light on how benign tissues preserve consistent symmetric glandular patterns; while cancerous samples exhibit pronounced asymmetry, irregular boundaries, and disrupted structural repetition. Authors further aim to quantify these differences using lightweight 2D symmetry indices, demonstrating a clear separation between normal and malignant tissues. Current research presents a highly precise model for the diagnosis of colon cancer using a VGG16 CNN that achieves an encouraging test accuracy of 99.85%. The model exhibited very high precision, recall, and F1-scores for both classes, normal and cancer, as demonstrated by the classification report. Among various XAI techniques, Grad-CAM demonstrated speed and scalability making it an appropriate choice for its large-scale deployment in healthcare. SHAP, though computationally costly, offered theoretical robustness and great insight. LIME was handy in local interpretability, especially convenient in debugging individual predictions.
Postpartum depression (PPD) affects 10-15% of individuals annually, yet early identification and treatment remains challenging. We introduce ClinPreAI, a novel agentic AI system that autonomously designs, implements, and evaluates machine learning solutions for PPD risk prediction using multimodal electronic health record data. We analyzed data from 4,161 pregnant individuals hospitalized prior to delivery for medical or obstetrical complications at Texas Children's Hospital (2012-2025), extracting 27 structured clinical variables and social worker notes. The primary outcome was Edinburgh Postnatal Depression Scale (EPDS) score ≥10 (31.0% prevalence) within 6 months after delivery, indicating clinically significant depressive symptoms. ClinPreAI operates through five specialized modules that iteratively refine predictive models through autonomous experimentation. ClinPreAI demonstrated strong performance across modalities. On structured data, it achieved F1: 0.68 ± 0.03, outperforming traditional AutoML (F1: 0.64 ± 0.02) and commercial solutions (AWS Canvas F1: 0.54-0.55). On multimodal data, ClinPreAI achieved F1: 0.65 ± 0.04, matching custom LLM-XGBoost (F1: 0.65 ± 0.01) and outperforming zero-shot models (Claude Opus F1: 0.51-0.52). This represents the first application of agentic AI to perinatal mental health prediction. Our results demonstrate that autonomous AI agents can democratize sophisticated predictive modeling in clinical settings, which is particularly valuable where domain experts lack ML training. By automating experimentation and debugging, agentic systems lower barriers to developing robust clinical prediction tools while maintaining interpretability.
Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain, AutoML, and AIDE, ML scientists still struggle to effectively review and adjust the agents' coding process. The current approach of manually inspecting individual outputs is inefficient, making it difficult to track code evolution, compare coding iterations, and identify improvement opportunities. To address this challenge, we introduce a visual analytics system designed to enhance the examination of coding agent behaviors. Focusing on the AIDE framework, our system supports comparative analysis across three levels: (1) Code-Level Analysis, which reveals how the agent debugs and refines its code over iterations; (2) Process-Level Analysis, which contrasts different solution-seeking processes explored by the agent; and (3) LLM-Level Analysis, which highlights variations in coding behavior across different LLMs. By integrating these perspectives, our system enables ML scientists to gain a structured understanding of agent behaviors, facilitating more effective debugging and prompt engineering. Through case studies using coding agents to tackle popular Kaggle competitions, we demonstrate how our system provides valuable insights into the iterative coding process.
Cable-driven parallel robots (CDPRs) are attractive for large-space manipulation because of their lightweight structure, large workspace, and reconfigurability. However, existing systems still face three practical challenges: limited modularity of the mechanical architecture, repeated calibration after reconfiguration, and insufficient integration between visual perception and grasp execution. To address these issues, this paper presents a modular cable-driven parallel robot (MCDPR), together with its kinematic modeling, vision-based self-calibration, and visual grasping methods. First, a modular mechanical architecture is developed in which the drive, sensing, and cable-guiding functions are integrated to support rapid assembly/disassembly, convenient debugging, and cable anti-slack operation. Second, a pulley-considered multilayer kinematic model is established, and a vision-based self-calibration method is proposed to identify the structural parameters after assembly using onboard sensing and AprilTag observations, thereby reducing the number of recalibrations required during robot operation after reconfiguration. Third, a vision-guided bin-picking method is developed by combining RGB-D perception, coordinate transformation, and the calibrated robot model. Simulation and prototype experiments are conducted to validate the proposed system. A software/hardware combined validation framework is established, in which the CoppeliaSim-based simulation and the hardware prototype are used together to verify the proposed design and methods. In simulation, self-calibration reduces the Euclidean grasping position error from 0.371 mm to 0.048 mm and the orientation error from 0.071° to 0.004°. In experiments, the relative position error is reduced by 58.33% after self-calibration.
This work proposes a semantic ontology-based dataset leveraging fine tuning large language model to facilitate JavaScript debugging and domain-specific code generation. Ontology is used to train the model with a dataset that has an exact or logical relationship between JavaScript syntax elements. The system gains deep subject knowledge with the help of a formal linked database, producing a high-quality QandA dataset from it, and employing parameter-efficient fine-tuning of a base LLM (LLaMA-3B). The fine-tuned model is assessed through a strict framework for domain competency. Code correctness, logical consistency, adaptability, and error detection efficiency metrics were used for evaluation. Experimental results show that the ontology-augmented model performs much better across all measures than baseline generic LLaMA model. Baseline here refers to a model refined on non-ontology data, and retrieval-based techniques. Logical verification and comparisons of fine-tuning techniques (BitFit, LoRA, and standard tuning) is provided. For performance contextualization, an additional benchmark against a cutting-edge code model (CodeLlama) is provided. The enhanced outcomes show that using ontologies to incorporate structured semantic knowledge can result in significant improvements in domain-specific code comprehension, providing a repeatable route for creating specialized programming AI systems. For reproducibility, the implementation and resources (ontology, SPARQL queries, code) are made publicly accessible.