Preprints-scientific manuscripts shared publicly prior to formal peer review-are gaining momentum across academic disciplines. However, their adoption in clinical and biomedical sciences remains limited, particularly in countries where traditional publishing norms prevail. Editorial ambiguity and a lack of national policy further complicate their use. This study aimed to assess the awareness, experiences, and attitudes of medical academics at Marmara University School of Medicine toward preprints and to explore the editorial landscape through both journal editor feedback and a review of journal-level preprint policies. A cross-sectional survey was conducted with 103 medical faculty members. The questionnaire included demographic questions, Likert scale items, and multiple-choice items assessing knowledge, familiarity, and attitudes toward preprints, as well as open-ended items to explore concerns. A "preprint test score" (0-4) was developed to quantify objective knowledge. Subgroup analyses were conducted by age (<40 vs ≥40 y) and academic discipline (basic vs clinical sciences). Additionally, all responses to open-ended questions from journal editors and 118 biomedical journals were manually reviewed for their stated stance on preprints and article processing charges (APCs). A convergent mixed methods design was used, combining a structured survey, thematic analysis of open-ended responses and editorial feedback, and a document-based review of biomedical journal policies. Only 42.9% (n=34) of participants reported familiarity with the concept of preprints, and 13% (n=10) had previously published on a preprint server. Misconceptions about ethics, peer review, and compatibility with journal policies were common. Subgroup analysis revealed that older participants scored higher on the "preprint test" (mean 2.20, SD 1.31 vs mean 1.97, SD 1.60) and had more experience with preprint publishing (1/40, 2.5% of younger participants; 7/29, 24.1% of older participants). Further, younger academics expressed less openness toward future use (n=7, 17.5% in the younger group; n=8, 27.6% in the older group). Clinical faculty were generally more hesitant than basic science faculty, although both groups raised concerns about the academic recognition of preprints. Editorial responses reflected a mix of cautious endorsement and skepticism. Among the 118 biomedical journals reviewed, most lacked clear preprint policies, while a small number either explicitly prohibited or permitted them. There is limited awareness and cautious engagement with preprints among medical academics and editors in Türkiye. Generational and discipline-based differences further influence knowledge and attitudes. The lack of clear editorial guidance from biomedical journals may reinforce academic uncertainty. Tailored educational initiatives, transparent journal policies, and institutional support will be essential to foster a more open and inclusive scientific publishing environment.
Generative artificial intelligence models, especially reasoning large language models (LLMs), are gaining adoption in health care for diagnostic decision support and medical education. DeepSeek R1 is a reasoning LLM that generates extended chain-of-thought explanations to make its decision-making process more explicit. Traditional medical benchmarks often lack complexity and authenticity, motivating the adoption of scenario-rich datasets, such as the Massive Multitask Language Understanding Pro (MMLU-Pro) professional medicine subset, which provides multispecialty clinical vignettes for reasoning-centric evaluation. The objective of this study is to assess the diagnostic accuracy, reasoning quality, reasoning transparency, and practical usability of DeepSeek R1 and Gemini 3 Pro across closed- and open-ended clinical scenarios, with the intention of guiding their prospective application in practical clinical education and training. This evaluation was conducted by analyzing 162 diverse medical scenarios (both closed- and open-ended) from the MMLU-Pro health subset. In a 2-phase, dual-model evaluation, DeepSeek R1 and Gemini 3 Pro were applied to 162 matched clinical vignettes from the MMLU-Pro professional medicine subset spanning 21 specialties. Closed-ended, multiple-choice, and open-ended prompts were constructed for the same scenarios, and model outputs were coded for accuracy, reasoning steps, and citation behavior; descriptive statistics and the McNemar test were used to compare performance across formats. DeepSeek R1 achieved an accuracy of 86.4% (140/162 scenarios) on closed-ended tasks and 80.9% (131/162) on open-ended questions across 162 clinical scenarios, indicating modest attenuation of performance when answer cues were removed. Gemini 3 Pro demonstrated 90.7% (147/162) closed-ended and 88.9% (144/162) open-ended accuracy on the same scenarios, showing a similar pattern of decreased performance without answer options. Error analysis indicated that incorrect answers typically involved longer reasoning chains, suggesting overthinking. In a structured review of open-ended responses, DeepSeek R1 produced an average of 18.7 (range 0-52) references per case, with 5.2 unrelated references and 13.1 (range 3-67) reasoning steps, whereas Gemini 3 Pro averaged 22.5 (range 12-50) references, 1.9 (range 0-8) unrelated references, and 4.4 (range 1-10) reasoning steps per case. DeepSeek R1 demonstrated moderate-to-excellent accuracy and reasoning in evaluating both closed- and open-ended medical scenarios. In parallel, Gemini 3 Pro showed broadly comparable but distinct performance and reasoning patterns. While the closed-ended format may inflate accuracy due to cueing, the open-ended evaluation yielded richer insights into the fidelity of reasoning. Side-by-side evaluation of two large reasoning models highlights the importance of format, specialty, and citation behavior when considering clinical and educational use. Continued validation across a wider range of specialties and real-world contexts will enhance the model's trustworthiness for diagnostic and teaching applications.
Artificial intelligence (AI) has evolved through various trends, with different subfields gaining prominence over time. Currently, conversational AI-particularly generative AI-is at the forefront. Conversational AI models are primarily focused on text-based tasks and are commonly deployed as chatbots. Recent advancements by OpenAI have enabled the integration of external, independently developed models, allowing chatbots to perform specialized, task-oriented functions beyond general language processing. This study aims to develop a smart chatbot that integrates large language models from OpenAI with specialized domain-specific models, such as those used in medical image diagnostics. The system leverages transfer learning via Google's Teachable Machine to construct image-based classifiers and incorporates a diabetes detection model developed in TensorFlow.js. A key innovation is the chatbot's ability to extract relevant parameters from user input, trigger the appropriate diagnostic model, interpret the output, and deliver responses in natural language. The overarching goal is to demonstrate the potential of combining large language models with external models to build multimodal, task-oriented conversational agents. Two image-based models were developed and integrated into the chatbot system. The first analyzes chest X-rays to detect viral and bacterial pneumonia. The second uses optical coherence tomography images to identify ocular conditions such as drusen, choroidal neovascularization, and diabetic macular edema. Both models were incorporated into the chatbot to enable image-based medical query handling. In addition, a text-based model was constructed to process physiological measurements for diabetes prediction using TensorFlow.js. The architecture is modular; new diagnostic models can be added without redesigning the chatbot, enabling straightforward functional expansion. The findings demonstrate effective integration between the chatbot and the diagnostic models, with only minor deviations from expected behavior. Additionally, a stub function was implemented within the chatbot to schedule medical appointments based on the severity of a patient's condition, and it was specifically tested with the optical coherence tomography and X-ray models. This study demonstrates the feasibility of developing advanced AI systems-including image-based diagnostic models and chatbot integration-by leveraging AI as a service. It also underscores the potential of AI to enhance user experiences in bioinformatics, paving the way for more intuitive and accessible interfaces in the field. Looking ahead, the modular nature of the chatbot allows for the integration of additional diagnostic models as the system evolves.
SARS-CoV-2, the causative agent of COVID-19, remains a global health concern due to its high transmissibility and evolving variants. Although vaccination efforts and therapeutic advancements have mitigated disease severity, emerging mutations continue to challenge diagnostics and containment strategies. As of mid-February 2025, global test positivity has risen to 11%, marking the highest level in over 6 months, despite widespread immunization efforts. Newer variants demonstrate enhanced host cell binding, increasing both infectivity and diagnostic complexity. This study aimed to evaluate the effectiveness of deep transfer learning in delivering a rapid, accurate, and mutation-resilient COVID-19 diagnosis from medical imaging, with a focus on scalability and accessibility. An automated detection system was developed using state-of-the-art convolutional neural networks, including VGG16 (Visual Geometry Group network-16 layers), ResNet50 (residual network-50 layers), ConvNeXtTiny (convolutional next-tiny), MobileNet (mobile network), NASNetMobile (neural architecture search network-mobile version), and DenseNet121 (densely connected convolutional network-121 layers), to detect COVID-19 from chest X-ray and computed tomography (CT) images. Among all the models evaluated, DenseNet121 emerged as the best-performing architecture for COVID-19 diagnosis using X-ray and CT images. It achieved an impressive accuracy of 98%, with a precision of 96.9%, a recall of 98.9%, an F1-score of 97.9%, and an area under the curve score of 99.8%, indicating a high degree of consistency and reliability in detecting both positive and negative cases. The confusion matrix showed minimal false positives and false negatives, underscoring the model's robustness in real-world diagnostic scenarios. Given its performance, DenseNet121 is a strong candidate for deployment in clinical settings and serves as a benchmark for future improvements in artificial intelligence-assisted diagnostic tools. The study results underscore the potential of artificial intelligence-powered diagnostics in supporting early detection and global pandemic response. With careful optimization, deep learning models can address critical gaps in testing, particularly in settings constrained by limited resources or emerging variants.
Consumer-level drug recalls usually require action by individual patients. The Food and Drug Administration (FDA) has public-facing outlets to inform the public about drug safety information, including all recalls, but individual consumers may not be aware of them. And there is no system in place to notify individual prescribers which of their patients are affected by a specific recall. We aimed to leverage the FDA's Healthy Citizen prototype web-based software platform, which provides users with information about recalls, to automatically notify patients of relevant recalls. We developed and evaluated an electronic notification system in the primary care and cardiology practices at a large, urban, academic medical center. The health care portal scanned the FDA Healthy Citizen application programming interface nightly to detect new recalls, identified patients who had those medications in their electronic health record (EHR) medication list, and sent them a message through the EHR patient portal with a link to a customized FDA information display. Using structured interviews, we assessed qualitative feedback on the system and portal messaging from a convenience sample of 9 patients. The system was technically functional, but it was not possible to trace a medication prescription from the EHR to specific lot numbers dispensed to that patient by a community pharmacy. The qualitative feedback obtained from patients showed convergence of topics. Lack of an accurate electronic audit trail from prescription to dispensed medication precludes clinical deployment of automated drug recall notification.
Genetic testing can determine familial and personal risks for heritable thoracic aortic aneurysms and dissections (TAD). The 2022 American College of Cardiology/American Heart Association guidelines for TAD recommend management decisions based on the specific gene mutation. However, many clinicians lack sufficient comfort or insight to integrate genetic information into clinical practice. We therefore developed the Genomic Medicine Guidance (GMG) application, an interactive point-of-care tool to inform clinicians and patients about TAD diagnosis, treatment, and surveillance. GMG is a REDCap-based application that combines publicly available genetic data and clinical recommendations based on the TAD guidelines into one translational education tool. TAD genetic information in GMG was sourced from the Montalcino Aortic Consortium, a worldwide collaboration of TAD centers of excellence, and the National Institutes of Health genetic repositories ClinVar and ClinGen. The application streamlines data on the 13 most frequently mutated TAD genes with 2286 unique pathogenic mutations that cause TAD so that users receive comprehensive recommendations for diagnostic testing, imaging, surveillance, medical therapy, and preventative surgical repair, as well as guidance for exercise safety and management during pregnancy. The application output can be displayed in a clinician view or exported as an informative pamphlet in a patient-friendly format. The overall goal of the GMG application is to make genomic medicine more accessible to clinicians and patients while serving as a unifying platform for research. We anticipate that these features will be catalysts for collaborative projects aiming to understand the spectrum of genetic variants contributing to TAD.
The existence of the variable component of the systematic error (VCSE) was known from the beginning. Still, it is a kind of taboo: it does not have a definition in the International Vocabulary of Metrology and is not present in equations, as it is considered transformed over time into random error. This theoretical study aims to reevaluate the role and significance of the VCSE in quality control (QC). Assuming three quintessential principles-(1) a parameter must be determined under the same conditions under which it is used, (2) a calibration cannot correct smaller biases than the calibration error, and (3) a constant cannot correct a variable-it was deduced that the source of the VCSE is bias drift caused by reagent instability and the shifts caused by human interventions. Both phenomena are mentioned in the literature. The two causes were confirmed by two series of computer simulations using 1000 normally distributed values with an SD of 1 to simulate random error and appropriately chosen bias values to simulate (1) drifts with different slopes and (2) variable shifts. Real-life examples from day-to-day QC, using Roche reagents on Cobas 6000 and Cobas PRO analyzers, confirmed the computer simulations. "The bias" is a definitional uncertainty because bias is time-variable. The causes of the cyclic variations are reagent instability and human intervention, confirmed by computer simulation and real-life QC data. Making a clear distinction between bias measured under repeatability and reproducibility within laboratory conditions, as in the case of SDs, and also separating constant and variable subcomponents of the systematic error, 2 sets of error parameters are obtained, each set being consistent with the measurement conditions. The link between them is the time-variable VCSE function. More properties of the VCSE(t) impose a distinction from random error component: predictability and corrigibility in the short term and non-Gaussian distribution. Its transformation into random phenomena is a myth based on confusion between random and variable error components. The accurate determination of the VCSE(t) function is possible, but it has an excessively high cost-effectiveness ratio. Because it is hidden in the bias measured in repeatability and in the SD in reproducibility within laboratory conditions, it helps us to avoid the redundant use in total measurement error and MU equations. Several false assumptions behind the Westgard rules were uncovered. The new error model aims to serve as the foundation of a new QC system. Internal QC decisions are only consistent with graphs designed using SD measured in repeatability conditions; therefore, they are not consistent with the actual Westgard rules. Alarms should be avoided in cases of incorrigible biases. Immediately after calibration, constant biases, gradually increasing biases, and unexpected shifts in bias represent distinct situations, each requiring a unique strategy.
The IT sector is growing and encompasses all professions, from leisure and recreation to hospitals and emergency response groups. IT professionals are experiencing increased threats (eg, ransomware attacks), but little is known about the relationship between these IT profession-specific stressors and the professionals' mental health. This study aimed to (1) estimate the associations between IT profession-specific stressors and anxiety, depression, and stress, and (2) examine the role of mental health literacy (MHL) as a mediator of the relationship between depression, anxiety, stress, and help-seeking. Between February and May 2023, IT professionals working in the United States were surveyed online. Participants (n=357) reported demographic characteristics, MHL, mental health symptoms, and help-seeking intentions with the following scales: Mental Health Literacy in the Workplace (MHL-W), Center for Epidemiological Studies Depression-10 (CESD-10), Generalized Anxiety Disorder-7 (GAD-7), Perceived Stress Scale-10 (PSS-10), and the Mental Help Seeking Intention Scale (MHSIS). Descriptive statistics, regression models, and mediation analyses were conducted for CESD-10, GAD-7, and PSS-10. Respondents who had experienced ransomware attacks in the past year reported significantly higher symptoms of depression (odds ratio [OR] 1.85, 95% CI 1.07-3.22; P=.03). Past-year exposure to balancing security and usability was associated with lower odds of reported anxiety (OR 0.48, 95% CI 0.28-0.82; P=.008). Having made critical technology decisions with limited information in the past year was associated with higher perceived stress by 2.02 points on the PSS-10 scale (SE 0.84, 95% CI 0.37-3.66; P=.02), and working with limited resources in the past year increased perceived stress by 1.70 points (SE 0.84, 95% CI 0.04-3.35; P=.04) after adjusting for the covariates. MHL was found to partially mediate the relationship between depression and help-seeking, but not between anxiety or stress and help-seeking. These findings provide insight into the workplace stressors that pose a greater psychological health risk for IT professionals. These results emphasize the important role of MHL in helping facilitate the connection between depressive symptoms and help-seeking.
Studies have shown that large language models (LLMs) are promising in therapeutic decision-making, with findings comparable to those of medical experts, but these studies used highly curated patient data. This study aimed to determine if LLMs can make guideline-concordant treatment decisions based on patient data as typically present in clinical practice (lengthy, unstructured medical text). We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR; n=24) or transcatheter aortic valve replacement (TAVR; n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, LLaMA-2, Mistral, PaLM 2, and DeepSeek-R1) were queried using either anonymized original medical reports or manually generated case summaries to determine the most guideline-concordant treatment. We measured agreement with the heart team using Cohen κ coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using the frequency bias index (FBI; FBI >1 indicated bias toward TAVR). When presented with original medical reports, LLMs showed poor performance (Cohen κ coefficient: -0.47 to 0.22; ICC: 0.0-1.0; FBI: 0.95-1.51). The LLMs' performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (Cohen κ coefficient: -0.02 to 0.63; ICC: 0.01-1.0; FBI: 0.46-1.23). Qualitative analysis revealed instances of hallucinations in all LLMs tested. Even advanced LLMs require extensively curated input for informed treatment decisions. Unreliable responses, bias, and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.
The increasing integration of artificial intelligence (AI) systems into critical societal sectors has created an urgent demand for robust privacy-preserving methods. Traditional approaches such as differential privacy and homomorphic encryption often struggle to maintain an effective balance between protecting sensitive information and preserving data utility for AI applications. This challenge has become particularly acute as organizations must comply with evolving AI governance frameworks while maintaining the effectiveness of their AI systems. This paper aims to introduce and validate data obfuscation through latent space projection (LSP), a novel privacy-preserving technique designed to enhance AI governance and ensure responsible AI compliance. The primary goal is to develop a method that can effectively protect sensitive data while maintaining essential features necessary for AI model training and inference, thereby addressing the limitations of existing privacy-preserving approaches. We developed LSP using a combination of advanced machine learning techniques, specifically leveraging autoencoder architectures and adversarial training. The method projects sensitive data into a lower-dimensional latent space, where it separates sensitive from nonsensitive information. This separation enables precise control over privacy-utility trade-offs. We validated LSP through comprehensive experiments on benchmark datasets and implemented 2 real-world case studies: a health care application focusing on cancer diagnosis and a financial services application analyzing fraud detection. LSP demonstrated superior performance across multiple evaluation metrics. In image classification tasks, the method achieved 98.7% accuracy while maintaining strong privacy protection, providing 97.3% effectiveness against sensitive attribute inference attacks. This performance significantly exceeded that of traditional anonymization and privacy-preserving methods. The real-world case studies further validated LSP's effectiveness, showing robust performance in both health care and financial applications. Additionally, LSP demonstrated strong alignment with global AI governance frameworks, including the General Data Protection Regulation, the California Consumer Privacy Act, and the Health Insurance Portability and Accountability Act. LSP represents a significant advancement in privacy-preserving AI, offering a promising approach to developing AI systems that respect individual privacy while delivering valuable insights. By embedding privacy protection directly within the machine learning pipeline, LSP contributes to key principles of fairness, transparency, and accountability. Future research directions include developing theoretical privacy guarantees, exploring integration with federated learning systems, and enhancing latent space interpretability. These developments position LSP as a crucial tool for advancing ethical AI practices and ensuring responsible technology deployment in privacy-sensitive domains.
Long COVID (post-COVID-19 condition) continues to challenge primary care. To support family physicians in British Columbia, the general internal medicine (GIM) COVID-19 Rapid Access to Consultative Expertise (RACE) line was launched in August 2020 to provide real-time specialist advice. This quality improvement study aimed to evaluate the implementation and utilization of the GIM-COVID-19 Long-Term Sequelae RACE line in British Columbia. Specifically, it sought to characterize the demographics of patients involved in RACE consultations, identify the most common themes and clinical queries presented by primary care providers, and assess how usage patterns evolved over time during the COVID-19 pandemic. We conducted a retrospective descriptive analysis of 149 RACE line call summaries between August 2020 and June 2021. Six calls were excluded due to insufficient information, such as incomplete documentation or absence of a clear COVID-19-related question. Because the original extraction notes are no longer available, further details about these calls cannot be provided, leaving 143 eligible calls. Data extracted included patient age, sex, geographical location, symptom type, and timing of symptom onset post-COVID-19 infection. Calls were categorized by symptom duration (acute: <2 wk, subacute: 2-12 wk, chronic: >12 wk), thematic content (respiratory, fatigue, neurological, etc), and query type (symptom management, return-to-work, vaccination, etc). Data were coded independently by two reviewers using a standardized spreadsheet and predefined codebook. Discrepancies were resolved through discussion. Descriptive statistics summarized the findings. Many calls involved female patients (91/143, 64%), with the most common age group being 40-49 years (32/113, 28%). Most calls came from Greater Vancouver (35/83, 42%) and the Fraser Valley (29/83, 35%). Subacute symptoms (52/149, 35%) and vaccination-related concerns (29/149, 19%) were the most common inquiry types. Symptom-related inquiries accounted for 92 of 143 calls (64%), with 253 symptoms documented overall. Respiratory symptoms were most common (100/253, 40%), especially shortness of breath (35 calls), cough (26), and fatigue (23). Call volumes peaked from January to June 2021, coinciding with the provincial vaccine rollout. The GIM-COVID-19 Long-Term Sequelae RACE line served as a critical early support system for primary care providers as the long COVID landscape evolved. This quality improvement study emphasizes the value of rapid access and specialist-informed consultation tools during emerging public health challenges. The trends ascertained may inform future health system responses, particularly when designing more scalable, interdisciplinary models to support primary care in managing complex chronic conditions.
The COVID-19 pandemic presented many unknowns for pregnant women, with anemia potentially worsening pregnancy outcomes due to multiple factors. This review aimed to determine the pooled effect of maternal anemia interventions and associated factors during the pandemic. Eligible studies were observational and included reproductive-age women receiving anemia-related interventions during the COVID-19 pandemic. Exclusion criteria comprised non-English publications, reviews, editorials, case reports, studies with insufficient data, sample sizes below 50, and those lacking DOIs. A systematic search of PubMed, Scopus, Embase, Web of Science, and Google Scholar identified articles published between December 2019 and August 2022. Risk of bias was evaluated using the Cochrane Risk of Bias 2 tool for randomized trials and the National Institutes of Health's assessment tool for observational studies. Pooled rate ratios (RRs) with 95% CIs were calculated in Review Manager 5.4.1. Synthesis included subgroup analysis, meta-regression, and publication bias checks to assess intervention effectiveness. This meta-analysis included 11 studies with 6129 pregnant women. Of these, 3591 (59%) were in the intervention group and 2538 (41%) were in the comparator group. Effects were recorded for 1921 (53.4%) women in the intervention group and 1350 (53.1%) in the comparator group. The cumulative impact ranged from 23% to 81%, averaging 56%. The initial analysis showed no significant effect on anemia prevention (RR 0.79, 95% CI 0.61-1.02; P=.07), with high heterogeneity (I²=97%). Sensitivity analysis excluding 4 outlier studies improved the effect size to a significant level at 39% (RR 0.61, 95% CI 0.43-0.87; P=.006). Subgroup analysis revealed substantial heterogeneity (I²=87.2%). Intravenous sucrose had a poor impact (RR 1.31, 95% CI 1.17-1.47; P<.001), while medicinal or herbal interventions showed benefit (RR 0.81, 95% CI 0.73-0.90; P=.006). Educational interventions yielded a 28% effect (RR 0.72), medicinal administration 19% (RR 0.81), iron supplementation 17% (RR 0.83), and intravenous ferric carboxylmaltose 15% (RR 0.85; P<.02). Additional sensitivity analysis confirmed a pooled positive effect of 17% (RR 0.83, 95% CI 0.79-0.88; P<.001), with minimal heterogeneity (I²=0%). Regionally, effectiveness was highest in Africa (RR 0.84, 95% CI 0.79-0.89; P<.001). Multicenter studies and those with 2020 data were predictive of better outcomes (RR 0.84 and RR 0.50, respectively). Despite initial heterogeneity and publication bias, interventions showed utility in mitigating maternal anemia in targeted subgroups and regions. Maternal anemia interventions during the COVID-19 pandemic showed modest, context-specific effectiveness, with declining impact from 2020 to 2022. Although high heterogeneity and study inconsistencies limited generalizability, significant benefits were observed particularly in African and multicenter studies. The pandemic exposed gaps in maternal health systems, emphasizing the need for tailored interventions, stronger data infrastructure, and resilient care strategies in future global crises.
Due to its diagnostic accuracy, point-of-care ultrasound (POCUS) is becoming more frequently used in the emergency department (ED), but the feasibility of its use by in-training residents and the potential clinical impact have not been assessed. This study aimed to assess the feasibility of implementing a structured POCUS training program for in-training ED residents, as well as the clinical impact of their use of POCUS in the management of patients in the ED. IMPULSE (Impact of a Point-of Care Ultrasound Examination) is a before-and-after implementation study evaluating the impact of a structured POCUS training program for ED residents on the management of patients admitted with acute respiratory failure (ARF) and/or circulatory failure (ACF) in a Swiss regional hospital. The training curriculum was organized into 3 steps and consisted of a web-based training course; an 8-hour, practical, hands-on session; and 10 supervised POCUS examinations. ED residents who successfully completed the curriculum participated in the postimplementation phase of the study. Outcomes were time to ED diagnosis, rate and time to correct diagnosis in the ED, time to prescribe appropriate treatment, and in-hospital mortality. Standard statistical analyses were performed using chi-square and Mann-Whitney U tests as appropriate, supplemented by Bayesian analysis, with a Bayes factor (BF)>3 considered significant. A total of 69 and 54 patients were included before and after implementation of the training program, respectively. The median time to ED diagnosis was 25 (IQR 15-60) minutes after implementation versus 30 (IQR 10-66) minutes before implementation, a difference that was significant in the Bayesian analysis (BF=9.6). The rate of correct diagnosis was higher after implementation (51/54, 94% vs 36/69, 52%; P<.001), with a significantly shorter time to correct diagnosis after implementation (25, IQR 15-60 min vs 43, IQR 11-70 min; BF=5.0). The median time to prescribe the appropriate therapy was shorter after implementation (47, IQR 25-101 min vs 70, IQR 20-120 min; BF=2.0). Finally, there was a significant difference in hospital mortality (9/69, 13% vs 3/54, 6%; BF=15.7). The IMPULSE study shows that the implementation of a short, structured POCUS training program for ED residents is not only feasible but also has a significant impact on their initial evaluation of patients with ARF and/or ACF, improving diagnostic accuracy, time to correct diagnosis, and rate of prescribing the appropriate therapy and possibly decreasing hospital mortality. These results should be replicated in other settings to provide further evidence that implementation of a short, structured POCUS training curriculum could significantly impact ED management of patients with ARF and/or ACF.
Alzheimer disease (AD) is a severe neurological brain disorder. While not curable, earlier detection can help improve symptoms substantially. Machine learning (ML) models are popular and well suited for medical image processing tasks such as computer-aided diagnosis. These techniques can improve the process for an accurate diagnosis of AD. In this paper, a complete computer-aided diagnosis system for the diagnosis of AD has been presented. We investigate the performance of some of the most used ML techniques for AD detection and classification using neuroimages from the Open Access Series of Imaging Studies (OASIS) and Alzheimer's Disease Neuroimaging Initiative (ADNI) datasets. The system uses artificial neural networks (ANNs) and support vector machines (SVMs) as classifiers, and dimensionality reduction techniques as feature extractors. To retrieve features from the neuroimages, we used principal component analysis (PCA), linear discriminant analysis, and t-distributed stochastic neighbor embedding. These features are fed into feedforward neural networks (FFNNs) and SVM-based ML classifiers. Furthermore, we applied the vision transformer (ViT)-based ANNs in conjunction with data augmentation to distinguish patients with AD from healthy controls. Experiments were performed on magnetic resonance imaging and positron emission tomography scans. The OASIS dataset included a total of 300 patients, while the ADNI dataset included 231 patients. For OASIS, 90 (30%) patients were healthy and 210 (70%) were severely impaired by AD. Likewise for the ADNI database, a total of 149 (64.5%) patients with AD were detected and 82 (35.5%) patients were used as healthy controls. An important difference was established between healthy patients and patients with AD (P=.02). We examined the effectiveness of the three feature extractors and classifiers using 5-fold cross-validation and confusion matrix-based standard classification metrics, namely, accuracy, sensitivity, specificity, precision, F1-score, and area under the receiver operating characteristic curve (AUROC). Compared with the state-of-the-art performing methods, the success rate was satisfactory for all the created ML models, but SVM and FFNN performed best with the PCA extractor, while the ViT classifier performed best with more data. The data augmentation/ViT approach worked better overall, achieving accuracies of 93.2% (sensitivity=87.2, specificity=90.5, precision=87.6, F1-score=88.7, and AUROC=92) for OASIS and 90.4% (sensitivity=85.4, specificity=88.6, precision=86.9, F1-score=88, and AUROC=90) for ADNI. Effective ML models using neuroimaging data could help physicians working on AD diagnosis and will assist them in prescribing timely treatment to patients with AD. Good results were obtained on the OASIS and ADNI datasets with all the proposed classifiers, namely, SVM, FFNN, and ViTs. However, the results show that the ViT model is much better at predicting AD than the other models when a sufficient amount of data are available to perform the training. This highlights that the data augmentation process could impact the overall performance of the ViT model.
Breast cancer is the leading cause of morbidity and mortality worldwide. Accurate sentinel lymph node (SLN) mapping is crucial for staging and treatment planning in early-stage breast cancer. Indocyanine green (ICG) has emerged as a promising agent for fluorescence imaging in SLN mapping. However, comprehensive assessment of its clinical utility, including accuracy and adverse effects, remains limited. This scoping review aims to consolidate evidence on the use of ICG in breast cancer SLN mapping. The objective of this scoping review is to evaluate the current literature on the use of ICG in SLN mapping for patients with breast cancer. This review aims to assess the accuracy, efficacy, and safety of ICG in this context and to identify gaps in the existing research. The outcomes will contribute to the development of further research as part of a PhD project. Five electronic databases will be searched (PubMed, Embase, MEDLINE, Web of Science, and Scopus) using search strategies developed in consultation with an academic supervisor. The search strategy is set to human studies published in English within the last 11 years. All retrieved citations will be imported to Zotero and then uploaded to Covidence for the screening of titles, abstracts, and full text according to prespecified inclusion criteria. Patients with early-stage breast cancer (T1 and T2), selected T3 cases where the SLN biopsy is accurate, and those with clinically node-negative breast cancer will be included. The intervention criterion includes studies using ICG for SLN mapping and studies on the assessment of fluorescence imaging cameras. Citations meeting the inclusion criteria for full-text review will have their data extracted by 2 independent reviewers, with disagreements resolved by discussion. A data extraction tool will be developed to capture full details about the participants, concept, and context, and findings relevant to the scoping review will be summarized. The preliminary search began in December 2023. As of September 2024, papers have been screened and data are currently being extracted. Out of the 2130 references initially imported, 126 studies met the inclusion criteria after screening. The scoping review is expected to be published in January 2025. Although ICG technology has been used for SLN mapping in patients with breast cancer, initial searches in 2022 revealed limited data on this technique's feasibility, safety, and effectiveness. At that time, preliminary search of Scopus, MEDLINE, Embase, and PubMed identified no current or forthcoming systematic reviews or scoping reviews on the topic. However, recent searches indicate a substantial increase in research and reviews, reflecting a growing interest and evidence in this area.
Urinary incontinence (UI) is a prevalent condition affecting millions worldwide, particularly women, with significant impacts on physical, psychological, and socioeconomic aspects of life (Haylen et al., Neurourol Urodyn 29:4–20, 2010; Aoki et al., Nat Rev Dis Primers 3:1–20, 2017). Conventional management includes behavioral therapy, pelvic floor muscle training (PFMT), and pharmacological interventions, but barriers such as social stigma, access to specialists, and poor treatment adherence persist (Nitti Rev Urol 3, 2001; Sinclair et al., Obstet Gynaecol 13:143-8, 2011; Minassian et al., 111:324-31, 2008; Milsom et al., Eur Urol 65:79-95, 2014). Telerehabilitation—defined as the delivery of rehabilitation services via electronic information and communication technologies (e.g., video conferencing and phone calls for improved access; mobile apps, websites, and virtual reality (VR) for enhanced engagement and self-management)—offers a potentially promising alternative to overcome these obstacles (Buckingham et al., JMIRx Med 3:e30516, 2022). This narrative review synthesizes evidence from studies conducted between January 2000 and November 6, 2025 on telerehabilitation’s role in UI management in women, focusing on stress UI, PFMT efficacy, and comparative outcomes with in-person therapy. It addresses gaps in prior systematic reviews by focusing on patient-centered designs and cultural adaptations. Key findings from 25 included studies indicate that telerehabilitation is feasible, effective in reducing UI symptoms, improving quality of life (QoL), and enhancing adherence, particularly through mobile apps and group-based interventions (Asklund et al., Neurourol Urodyn 36:1369-76, 2017; Sjostrom et al., BJU Int 112:362-72, 2013; Hoffman et al., Gynecol Scand 96:1180-7, 2017). However, limitations include heterogeneity in interventions, small sample sizes in many studies, lack of long-term data, absence of male participants, limited validation in rural or cognitively impaired populations, and insufficient cultural adaptations for diverse groups. Recommendations include developing tailored telerehabilitation programs incorporating biofeedback and interdisciplinary approaches to address UI holistically. This review highlights telerehabilitation’s potential as a scalable, cost-effective intervention, particularly post-COVID-19, and calls for further research in diverse female populations.
Cardiac implantable electronic devices (CIEDs) are crucial in managing various cardiac conditions, but their monitoring poses considerable challenges. Algorithm-enabled remote monitoring of these devices has emerged as a promising solution to enhance patient outcomes and potentially reduce health care expenditures; however, its economic impact remains underexplored. This systematic review protocol aims to review and synthesize the existing evidence on the cost-effectiveness and cost-utility of algorithm-enabled remote monitoring for CIEDs in patients with or at risk of heart failure. The search of literature will be performed in MEDLINE, Embase, Scopus, Web of Science, and the Cochrane Library, with supplementary searches in the National Health Service Economic Evaluation Database, the National Institute for Health and Care Excellence, the Canadian Agency for Drugs and Technologies in Health, the International Network of Agencies for Health Technology Assessment, and the National Institute for Health and Care Research. This protocol is reported in accordance with the PRISMA-P (Preferred Reporting Items for Systematic Reviews and Meta-Analysis Protocols) 2015 statement, and the completed review will be reported following the PRISMA 2020 statement. Following database searching and deduplication, 3108 records were retrieved; 731 (23.5%) duplicates were removed, leaving 2377 (76.5%) records for title and abstract screening. The review will identify and synthesize economic evaluations of algorithm-enabled remote monitoring in adults with CIEDs, summarizing reported costs, outcomes, and cost-effectiveness results. Methodological quality, risk of bias, and sources of heterogeneity across studies will be assessed. The findings of this review may help inform health care providers, policymakers, and other stakeholders by clarifying the current economic evidence on these monitoring systems, informing adoption decisions, and identifying areas requiring further research.
The International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use published the estimand framework in 2019. The estimand framework aims to clearly define a treatment effect for a clinical question through construction of estimands, and it has been widely applied in clinical trials in the pharmaceutical industry. The estimand framework proposes 5 attributes for an estimand: treatments, variables, target populations, population-level summaries, and intercurrent events. It also proposes the treatment policy strategy, the hypothetical strategy, the composite variable strategy, the while on treatment strategy, and the principal stratum strategy to handle intercurrent events. When people give clear definitions for these 5 attributes, they clearly define an estimand that represents a treatment effect. From a statistical perspective, a genuine or causal treatment effect is defined through a causal inference framework. This article aims to interpret the estimand framework using a causal inference framework and help researchers understand the differences between estimands and causal treatment effects. From a causal inference framework based on potential outcomes, an individual treatment effect (ITE) is defined by comparison of individual potential outcomes with experimental or control treatments, and the average treatment effect (ATE) of the experimental treatment versus the control treatment is defined as an average of all ITEs. The statistical presentation of the ATE is not equivalent to an estimand. It has the same treatments, variables, target populations, and population-level summaries as an estimand, but intercurrent events are not part of it. Intercurrent events modify the statistical presentation of the ATE through treatments, variables, and target populations, whose impact can be controlled by intercurrent event strategies. I propose that the estimand attributes can be mapped onto the statistical presentation of the ATE, and that intercurrent events act as mediation mechanisms in the attribute mapping process, which provides a novel way to incorporate the causal inference framework into the estimand framework. If the estimand framework is combined with a causal inference framework, it will gain a stronger theoretical foundation. The interpretation of the estimand framework from a causal inference perspective is useful for both industrial and academic clinical trials. Observational studies may also find useful information on causal inference theories in this article.
On average, 1 in 10 patients die because of a diagnostic error, and medical errors represent the third largest cause of death in the United States. While large language models (LLMs) have been proposed to aid doctors in diagnoses, no research results have been published comparing the diagnostic abilities of many popular LLMs on a large, openly accessible real-patient cohort. In this study, we set out to compare the diagnostic ability of 18 LLMs from Google, OpenAI, Meta, Mistral, Cohere, and Anthropic, using 3 prompts, 2 temperature settings, and 1000 randomly selected Medical Information Mart for Intensive Care-IV (MIMIC-IV) hospital admissions. We also explore improving the diagnostic hit rate of GPT-4o 05-13 with retrieval-augmented generation (RAG) by utilizing reference ranges provided by the American Board of Internal Medicine. We evaluated the diagnostic ability of 21 LLMs, using an LLM-as-a-judge approach (an automated, LLM-based evaluation) on MIMIC-IV patient records, which contain final diagnostic codes. For each case, a separate assessor LLM ("judge") compared the predictor LLM's diagnostic output to the true diagnoses from the patient record. The assessor determined whether each true diagnosis was inferable from the available data and, if so, whether it was correctly predicted ("hit") or not ("miss"). Diagnoses not inferable from the patient record were excluded from the hit rate analysis. The reported hit rate was defined as the number of hits divided by the total number of hits and misses. The statistical significance of the differences in model performance was assessed using a pooled z-test for proportions. Gemini 2.5 was the top performer with a hit rate of 97.4% (95% CI 97.0%-97.8%) as assessed by GPT-4.1, significantly outperforming GPT-4.1, Claude-4 Opus, and Claude Sonnet. However, GPT-4.1 ranked the highest in a separate set of experiments evaluated by GPT-4 Turbo, which tended to be less conservative than GPT-4.1 in its assessments. Significant variation in diagnostic hit rates was observed across different prompts, while changes in temperature generally had little effect. Finally, RAG significantly improved the hit rate of GPT-4o 05-13 by an average of 0.8% (P<.006). While the results are promising, more diverse datasets and hospital pilots, as well as close collaborations with physicians, are needed to obtain a better understanding of the diagnostic abilities of these models.
Chaotic dynamics has been the subject of both theoretical and empirical research in epidemiology, with the most recent research strongly focusing on SARS-CoV-2. However, few empirical studies have been undertaken with respect to influenza, even though evidence of chaos has also been found in influenza surveillance data. Furthermore, empirical studies on chaos are focused on reconstructing hidden attractors in epidemiological time series to filter out noise; however, dynamical noise affecting chaotic dynamics can have relevant epidemiological features that are, in this way, left unresearched and that can be used for epidemiological surveillance and risk analysis by capturing the main underlying nonlinear processes associated with epidemiological dynamics. This study aimed to reinforce empirical research on chaotic dynamics in influenza surveillance and the study of the dynamical noise affecting that chaotic dynamics, addressing the consequences for epidemiological risk analysis and surveillance. Working with the weekly share of positive influenza tests for the Northern Hemisphere from January 2009 to March 2025 compiled by Our World in Data using FluNet data from the World Health Organization, we applied a recent method based on topological data analysis for reconstructing underlying attractors from time series and decomposing the dynamics down to independent and identically distributed noise. We adapted the method to the epidemiological context so that it can be used for predictive decomposition with direct application to epidemiological risk analysis and surveillance. We found evidence of a low-dimensional chaotic attractor in the researched surveillance data, with a fractal dimension between 1 and 2, and a positive statistically significant largest Lyapunov exponent. The chaotic dynamics had power law scaling associated with long-wave influenza outbreaks, and it is affected by a stochastic component that is nonstationary in variance, leading to turbulent bursts in the outbreak dynamics. Testing different machine learning algorithms using the attractor as input for prediction and a 10-week rolling window, we found the following largest R2 scores for the prediction of the target series: 92.11% (1 week ahead), 85.95% (2 weeks ahead), 81.75% (3 weeks ahead), 77.59% (4 weeks ahead), and 73.35% (5 weeks ahead). The main results reinforce previous theoretical and empirical studies on chaos in epidemiology. Our findings showed that there is a 2-dimensional chaotic attractor that can support up to a 1-month prediction of the target surveillance series with high prediction scores and that the attractor plus noise can be modeled in a way that supports uncertainty quantification and epidemiological risk analysis.