Nitrogen fixation in oxygenic cyanobacteria depends on a system of genes that protect oxygen-sensitive nitrogenase, many of which likely remain uncharacterized. Here we predict FOX (fixation in the presence of oxygen) gene candidates in Anabaena sp. PCC 7120 by integrating nitrogen step-down RNA-seq (0/6/12/21 hours), quantitative proteomics, promoter architecture, genomic context, and reciprocal-best-hit conservation across diazotrophic and non-diazotrophic cyanobacteria. Using 68 literature-validated FOX genes and 835 conserved non-essential genes as a proxy negative class, we trained logistic regression, Random Forest, and XGBoost models and evaluated them using 20 repeated stratified 80/20 train–test splits. The best models achieved ROC–AUC up to 0.80 and average precision up to 0.55 and precision among the top 20 ranked genes reached 0.39 versus a 0.075 prevalence baseline. Model interpretation highlights late step-down induction, diazotroph-biased conservation, and genomic neighborhood signals as leading predictors. We generated genome-wide FOX probability scores used primarily for candidate ranking, nominating conserved genes spanning heterocyst envelope processes as well as broader redox, metabolism, and electron-pool regulation. We release these predictions and a public web-based optimizer that applies comparative-bioinformatics filters and size constraints to propose candidate accessory-gene complements for experimental testing and heterologous reconstitution efforts.
Transcription factors (TFs) and their target genes form regulatory networks that control gene expression and influence diverse biological processes and disease outcomes. Although multiple computational methods and curated databases have been developed to identify TF-target interactions, they often require specialized expertise. Large language models (LLMs) chatbots offer a more accessible alternative for querying TF-target interactions. In this study, we benchmarked four prominent LLMs, Anthropic's Claude 3.5 Sonnet, Google's Gemini 1.0 Pro, OpenAI's GPT-4o, and Meta's Llama3 8b, using 8432 literature-curated human TF-target interactions. We examined four regulatory categories: bidirectional, ambiguous, self-regulated, and unidirectional interactions. Under single-turn queries, Claude 3.5 Sonnet and GPT-4o outperformed the others, with balanced accuracies reaching 50.0 ± 7.6% (GPT-4o, self-regulated) and 48.2 ± 1.0% (Claude 3.5 Sonnet, unidirectional). Zero-temperature settings generally enhanced reproducibility, and multi-turn prompting improved performance for most models, increasing Claude 3.5 Sonnet's accuracy on self-regulated pairs by 32.6%. Excluding TF-target pairs with all unknown regulation types also generally improved accuracy, with unidirectional regulation reaching near 70% balanced accuracy in some cases. We also benchmarked Anthropic's Claude 3.5 Sonnet, Google's Gemini 2.0 Flash, OpenAI's GPT-4o, and Meta's Llama3 using 5148 experimentally derived TF-target interactions. Claude 3.5 Sonnet consistently outperformed the other models across conditions. Our findings highlight that prompt engineering and strategic use of model parameters consistently influence LLM chatbots' performance on TF-target identifications. This study establishes a benchmarking framework and demonstrates the potential of pre-trained general-purpose LLMs to support regulatory biology research, especially for researchers without extensive computational expertise. The literature-based TF-target interactions ground truth were obtained from TRRUST v2 human dataset (www.grnpedia.org/trrust). The experimental derived TF-target interactions ground truth were obtained from TFLink Home Sapiens small-scale interaction table (https://tflink.net/). Processed TF-target interactions data and the analytical pipeline has been compiled as an interactive Python notebook file and is available at https://github.com/pengpclab/LLM-TF-interactions.
Large language models (LLMs) show promise for guiding appropriate diagnostic imaging modality selection according to ACR criteria. This study compared seven LLMs - OpenEvidence (OpenEvidence Inc., Miami, Florida); OpenAI's GPT-5 Thinking and GPT-5 (OpenAI, San Francisco, California); Anthropic's Opus 4.1 and Sonnet 4.5 (Anthropic, San Francisco, California); and Google's Gemini 2.5 Pro and 2,5 Flash (Google LLC, Mountain View, California) - using 50 clinical vignettes to assess accuracy amd clinical reasoning in formulating imaging modality recommendations. Fifty text-based clinical vignettes were created from ACR guidelines, featuring five variants of 10 different medical complaints with subtle symptomatic or demographic alterations. A 3-point Likert scale was used to evaluate four performance metrics: imaging appropriateness, technical specificity, clinical rationale strength, and citation quality. Readability and word count were also assessed. Two blinded, independent reviewers rated the LLM outputs, with discrepancies resolved via consensus. A third reviewer was included for persistent disagreements. Analysis involved Friedman's test followed by pairwise Wilcoxon signed-rank testing with Holm correction (P < .05). Friedman testing demonstrated significant differences across all performance domains (P≤ .031). Appropriateness scores (range 1.60-1.88 out of 2.00) revealed no significant pairwise differences. Technical specificity (range 1.82-2.00) and clinical rationale (range 1.52-1.88) showed no significant pairwise differences. Citation quality (range 0.40-2.00) was the most variable; Gemini 2.5 Pro and Gemini 2.5 Flash hallucinated citations in 80% and 76% of prompts, respectively, performing worse than all other models (P < .001). Readability scores ranged from 15.27 to 22.19, and word counts from 90.10 to 195.02. All LLMs selected appropriate imaging modalities using reasonable clinical justification. Citation validity varied widely. Ensuring congruence between clinical reasoning and cited sources is essential before successful implementation.
Recent advancements in large language models (LLMs) have accelerated their integration into clinical domains, including laboratory medicine. The performance of LLMs in answering board-level laboratory medicine questions has not been comprehensively evaluated. Given the importance of diagnostic accuracy in this field, rigorous and objective evaluations of LLM capabilities are essential. We assessed 12 LLMs from OpenAI, Anthropic, and Google using 320 Korean Residency Examination questions (2021-2024) spanning six laboratory medicine subspecialties. Standardized prompts were provided via their application programming interfaces under deterministic settings (temperature=0). Questions were administered thrice to assess response reproducibility. Outputs were compared with validated answers and analyzed for accuracy, reasoning quality, and error typology. Google's Gemini 2.0 Pro achieved the highest accuracy (80.0%), followed by OpenAI's GPT-4.5 (77.2%) and Anthropic's Claude 3.7 Sonnet (74.1%). Accuracy decreased as the difficulty of questions increased (78.0% for easy vs. 45.1% for challenging). Subspecialty performance varied. Al models underperformed on questions on transfusion medicine (mean accuracy: 38.8%), primarily because of limitations in domain-specific and regional knowledge representations. Incorrect answers primarily resulted from reasoning errors. Reproducibility exceeded 95% for most models; however, some residual non-determinism appeared even with greedy decoding (temperature=0). LLMs demonstrated substantial potential for integration into laboratory medicine, particularly in clinical chemistry and immunology. Performance inconsistencies (particularly for high-difficulty questions) and knowledge gaps (notably for transfusion medicine) highlight the necessity for further development-potentially including domain-specific fine-tuning and retrieval-augmented generation integration-and robust expert oversight before clinical application.
暂无摘要(点击查看详情)
Trigeminal neuralgia (TN) caused by vertebrobasilar dolichoectasia (VBD) is a rare but particularly challenging entity. Microvascular decompression (MVD) is considered the most definitive treatment; however, outcomes in this subgroup remain incompletely characterized. We conducted a systematic review and meta-analysis following PRISMA guidelines. PubMed, Embase, Scopus, and Web of Science were searched from inception through August 22, 2025. Eligible studies reported on patients with VBD-TN undergoing MVD with extractable data on pain outcomes, recurrence, salvage interventions, or complications. Complete relief was defined as Barrow Neurological Institute (BNI)-I, while adequate relief included BNI-I to IIIb. Thirteen studies involving 315 patients were analyzed. The mean age ranged from 54.0 to 67.3 years, with 57.8% (182/315) being males. The pooled initial complete pain relief rate was 95.8% (95% CI, 92.3-98.2), with sustained relief at the last follow-up in 92.6% (95% CI, 88.4-96.1). Adequate relief was nearly universal, at 99.9% (95% CI, 98.2-100%) initially and 95.9% (95% CI, 91.8-98.8%) at the last follow-up. Pain recurrence occurred in 5.5% (95% CI, 2.9-8.9%), and salvage procedures were required in 1.3% (95% CI, 0.2-3.1%). The permanent morbidity was low at 2.4% (95% CI, 0.8-4.8%). Meta-regression indicated that prior ablative procedures were associated with higher complication rates, whereas V2 involvement predicted better long-term pain control. MVD appears to provide effective and durable pain relief for selected patients with VBD-TN, with low permanent morbidity but a clinically meaningful overall complication burden. Given the retrospective nature of the available evidence, MVD should be considered a promising treatment option rather than a definitive standard of care.
The author examined whether a large language model (LLM) can help identify noncompliance with the Mental Health Parity and Addiction Equity Act (MHPAEA) in health insurance plan documents. Using Anthropic's Claude 3.5 Sonnet between December 1, 2024, and January 31, 2025, the author analyzed primary documentation for the Essential Health Benefits benchmark plans for 2026. An LLM prompt was first validated, and the author assessed the LLM's positive predictive value (PPV) in applying that prompt to identify areas of potential MHPAEA noncompliance. The LLM then prioritized the top 10 areas of noncompliance among those accurately identified. The LLM identified on average 3.8 areas of potential noncompliance per document, with an average PPV of 49%. The findings indicate that LLMs currently have a relatively poor PPV in regulatory oversight tasks but may help improve efficiency by enabling rapid identification of potential MHPAEA noncompliance to prioritize areas for further review.
This study compares OpenAI's GPT-4o and Anthropic's Claude 4 in the generation of formative and summative feedback in Objective Structured Clinical Examinations (OSCEs) within Qpercom's assessment platform. A stratified sample of 51 anonymized student records was analyzed, comparing examiner-facing (pre-verification/preview) and student-facing (portfolio) feedback across both models. While both systems delivered actionable suggestions, Claude 4 consistently outperformed GPT-4o in alignment with examiner data, absence of hallucinations, and preservation of critical learning points-especially for underperforming and mid-performing students. This evidence-based evaluation recommends Claude 4 as the safer and more effective AI solution for high-stakes educational settings.
Background AI language models such as Google Gemini, OpenAI ChatGPT, and Anthropic's Claude are developing rapidly in response to the growing demand from various sectors of daily life, science, and industry. By collecting and processing extensive datasets, including medical data, they are becoming increasingly popular tools supporting not only IT specialists and programmers but also students and resident physicians in their studies and preparation for examinations, including specialization exams. Consequently, the reliability and accuracy of the information provided by these tools, i.e., AI language models, are often questioned. This concern formed the basis of the present study, which verified the utility of the Google Gemini 2.5 Pro model using the Polish State Specialization Examination (PES) in Pediatric Surgery. Objective The objective of this study was to assess the effectiveness and confidence levels of the Gemini 2.5 Pro model in answering PES questions, thereby evaluating its potential educational utility in the specialized surgical field of pediatric surgery. Methods The study was conducted using the most recent official PES from the spring 2025 session in pediatric surgery. The exam consisted of 120 multiple-choice questions (five options each, one correct answer). Based on previously published studies and the nature of the questions used in the PES across various medical disciplines in Poland, the questions were divided into two categories: clinical and general (theoretical). Before conducting the test, the Gemini 2.5 Pro model was presented with the PES regulations and then introduced to the examination paper containing the questions in Polish. The correctness of the solved test was verified against the official answer key from the Center for Medical Examinations (CEM) in Łódź. Additionally, the AI model was instructed to rate its confidence in each answer on a five-point scale (from 1 = no confidence to 5 = full confidence). The data obtained were analyzed statistically using the chi-squared test and the Mann-Whitney U test. Results The Google Gemini 2.5 Pro model achieved 103 correct answers, corresponding to an overall effectiveness of 85.83%, which is well above the 60% passing threshold. For subgroup analysis, the questions were divided into clinical and general categories, with the model scoring 83% and 91% correct answers, respectively. This difference was not statistically significant (p = 0.417), and the effect size (Cohen's h = 0.19) indicated a small effect. Furthermore, the model's confidence ratings showed that correct answers were generally given with higher confidence, while incorrect ones were associated with lower confidence. This suggests a positive correlation between confidence and accuracy, particularly for general questions. However, due to limited data, the exact effect size of this relationship could not be determined. Conclusions Gemini 2.5 Pro's strong performance on the PES demonstrates the considerable potential of advanced AI models in supporting medical education, even in highly specialized fields such as pediatric surgery. The observed association between correctness and declared confidence may help users gauge the reliability of AI-generated responses. Nevertheless, high performance in an examination setting does not eliminate the need for verification and critical evaluation of AI-generated answers in real-world clinical and educational applications.
The healthcare industry faces dual revolutions with the widespread adoption of mobile healthcare applications and the emergence of generative artificial intelligence (Gen AI). The Veterans Administration and other military healthcare providers particularly stand to benefit from these technologies given their unique challenges serving veterans. This case study explored how Gen AI might help bridge the historical gap between healthcare providers and software developers in creating more effective healthcare applications for veterans. The study utilized Anthropic's Claude 3.5 Sonnet, a large language model, to assist in developing requirements for a hypothetical healthcare application, Annie Pro. The process included uploading relevant documentation into the AI's context window and conducting an unstructured interview with the AI over an approximately 6-hour period, generating 80 pages of conversational text and 23 multi-page artifacts. Eight software developers were consulted to provide informal qualitative feedback on the resulting 26-page requirements document. The Gen AI demonstrated utility in requirements gathering, technical specification development, project planning, user flow mapping, and interface design. The AI showed particular strength in rapidly incorporating new requirements and explaining technical concepts to nontechnical stakeholders. Software developers reviewing the final product universally praised its value as a starting point for development, although some expressed concern about overly prescriptive technical specifications. This study suggests that Gen AI can effectively support healthcare providers in developing software requirements. While the technology shows promise in improving provider-developer communication, careful attention must be paid to avoid false confidence and over-specification. Future studies should look to replicate these results across different healthcare contexts and with different AI models as the technology continues to evolve rapidly.
暂无摘要(点击查看详情)
A few weeks ago, a colleague of mine needed to collect and format some data from a website, and he asked the latest version of Anthropic's generative AI system, Claude, for help. Claude cheerfully agreed to perform the task, generated a computer program to download the data, and handed over perfectly formatted results. The only problem? My colleague immediately noticed that the data Claude delivered was entirely fabricated.
Publicly available artificial intelligence (AI) Vision Language Models (VLMs) are constantly improving. The advent of vision capabilities on these models could enhance radiology workflows. Evaluating their performance in radiological image interpretation is vital to their potential integration into practice. This study aims to evaluate the proficiency and consistency of the publicly available VLMs, Anthropic's Claude and OpenAI's GPT, across multiple iterations in basic image interpretation tasks. Subsets from publicly available datasets, ROCOv2 and MURAv1.1, were used to evaluate 6 VLMs. A system prompt and image were input into each model three times. The outputs were compared to the dataset captions to evaluate each model's accuracy in recognising the modality, anatomy, and detecting fractures on radiographs. The consistency of the output across iterations was also analysed. Evaluation of the ROCOv2 dataset showed high accuracy in modality recognition, with some models achieving 100%. Anatomical recognition ranged between 61% and 85% accuracy across all models tested. On the MURAv1.1 dataset, Claude-3.5-Sonnet had the highest anatomical recognition with 57% accuracy, while GPT-4o had the best fracture detection with 62% accuracy. Claude-3.5-Sonnet was the most consistent model, with 83% and 92% consistency in anatomy and fracture detection, respectively. Given Claude and GPT's current accuracy and reliability, the integration of these models into clinical settings is not yet feasible. This study highlights the need for ongoing development and establishment of standardised testing techniques to ensure these models achieve reliable performance.
The demands of intensified aquaculture production and escalating disease prevalence underscore the need for efficacious probiotic strategies to enhance fish health. This study focused on isolating and characterising potential probiotics from the gut microbiota of the emerging aquaculture species jade perch (Scortum barcoo). Eighty-seven lactic acid bacteria and 149 other bacteria were isolated from the digestive tract of five adult jade perch. The screening revealed that 24 Enterococcus hirae isolates inhibited the freshwater pathogens Aeromonas sobria and Streptococcus iniae. Co-incubating E. hirae with the host gut suspensions demonstrated a two- to five-fold increase in the size of growth inhibition zones compared to the results when using gut suspensions from tilapia (a non-host), indicating host-specificity. Genome analysis of the lead isolate, E. hirae R44, predicted the presence of antimicrobial compounds like enterolysin A, class II lanthipeptide, and terpenes, which underlay its antibacterial attributes. Isolate R44 exhibited desirable probiotic characteristics, including survival at pH values within the range of 3 to 12, bile tolerance, antioxidant activity, ampicillin sensitivity, and absence of transferable antimicrobial resistance genes and virulence factors commonly associated with hospital Enterococcus strains (IS16, hylEfm, and esp). This study offers a foundation for sourcing host-adapted probiotics from underexplored aquaculture species. Characterisation of novel probiotics like E. hirae R44 can expedite the development of disease mitigation strategies to support aquaculture intensification.
Background: Large language models (LLMs) have the potential to enhance information processing and clinical reasoning in the healthcare industry but are hindered by inaccuracies and hallucinations. The retrieval-augmented generation (RAG) technique may address these problems by integrating external knowledge sources. Methods: We developed a RAG-based chatbot called Thyro-GenAI by integrating a database of textbooks and guidelines with LLM. Thyro-GenAI and three service LLMs: OpenAI's ChatGPT-4o, Perplexity AI's ChatGPT-4o, and Anthropic's Claude 3.5 Sonnet, were asked personalized clinical questions about thyroid disease. Three thyroid specialists assessed the quality of the generated responses and references without being blinded, which allowed them to interact with different chatbot interfaces. Results: Thyro-GenAI achieved the highest inverse-weighted mean rank for overall response quality. The overall inverse-weighted mean rankings for Thyro-GenAI, ChatGPT, Perplexity, and Claude were 3.0, 2.3, 2.8, and 1.9, respectively. Thyro-GenAI also achieved the second-highest inverse-weighted mean rank for overall reference quality. The overall inverse-weighted mean rankings for Thyro-GenAI, ChatGPT, Perplexity, and Claude were 3.1, 2.3, 3.2, and 1.8, respectively. Conclusions: Thyro-GenAI produced patient-specific clinical reasoning output based on a vector database, with fewer hallucinations and more reliability, compared to service LLMs. This emphasis on evidence-based responses ensures its safety and validity, addressing a critical limitation of existing LLMs. By integrating RAG with LLMs, it has the potential to support frontline clinical decision-making, especially helping first-line physicians by offering reliable decision support while managing thyroid disease patients.
To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case. A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response. All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI's GPT4 (95%). RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis. Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.
This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI's GPT-4 variants, Google's Gemini series, and Anthropic's Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a longitudinal assessment of GPT-3.5 Turbo, which was evaluated using the same set of questions one year ago to identify changes in its performance over time. We utilized a question bank comprising 2,576 multiple-choice and single-choice questions from a German online education platform tailored for otolaryngology board certification preparation. The questions were submitted to 11 different LLMs, including GPT-3.5 Turbo, GPT-4 variants, Gemini models, and Claude models, through Application Programming Interfaces (APIs) using Python scripts, facilitating efficient data collection and processing. GPT-4o demonstrated the highest accuracy among all models, particularly excelling in categories such as allergology and head and neck tumors. While the Claude models showed competitive performance, they generally lagged behind the GPT-4 variants. A comparison of GPT-3.5 Turbo's performance revealed a significant decline in accuracy over the past year. Newer LLMs displayed varied performance levels, with single-choice questions consistently yielding higher accuracy than multiple-choice questions across all models. While newer LLMs show strong potential in addressing specialized medical content, the observed decline in GPT-3.5 Turbo's performance over time underscores the necessity for continuous evaluation. This study highlights the critical need for ongoing optimization and efficient API usage to improve LLMs potential for applications in medical education and certification.
Large Language Models (LLMs) hold promise for clinical decision support, but their real-world performance varies. We compared three leading models (OpenAI's "o1" Large Reasoning Model (LRM), Anthropic's Claude-3.5-Sonnet, and Meta's Llama-3.2-70B) to human experts in an emergency internal medicine setting. We conducted a prospective comparative study on 73 anonymized patient cases from the Emergency Internal Medicine ward of the University Hospital Split, Croatia (June-September 2024). Two independent internal medicine specialists, blinded to model identity, graded the LLM-generated reports in two steps: (1) they evaluated the relevance of recommended diagnostic tests based on the patient's signs, symptoms, and medical history; (2) after reviewing the actual diagnostic test results, they assessed each model's final diagnosis, therapy plan, and follow-up recommendations. The same evaluative framework was applied to human-authored reports. Likert scales (1-4 or 1-3) were used, and statistical comparisons included the Friedman and Wilcoxon signed-rank tests. The o1 model achieved a mean final rating (3.63) statistically indistinguishable from human physicians (3.67; p = 0.62). Claude-3.5-Sonnet (3.38) and Llama-3.2-70B (3.23) scored significantly lower (p < 0.01 vs. o1), largely due to errors in therapy planning and non-medication recommendations. Despite this gap, all three models demonstrated ≥90 % accuracy in final diagnoses and patient admission decisions. The o1 model correctly classified all abnormal lab values (100 %), while Claude-3.5-Sonnet and Llama-3.2-70B showed minor errors (99.5 % and 99 % accuracy, respectively). When evaluated on real-world emergency cases, an advanced LLM with enhanced reasoning (o1) can match expert-level clinical performance, underscoring its potential utility as a decision-support tool.
Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI's GPT-3.5/4/4o/o1-preview, Anthropic's Claude-3-Opus, Meta's LLaMA-2 (7B/13B/70B), and Mistral AI's Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81-0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.
Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context. A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen's kappa. ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer-Recall of Knowledge (SBA-R), Single Best Answer-Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items. Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.