Large language models (LLMs) have diffused rapidly into academic writing since late 2022. Using the complete population of 109,393 research articles published in \textit{PLOS ONE} between 2019 and 2025, we examine population-level structural publication indicators, including full-text manuscript length, authorship team size, reference volume, and cross-linguistic collaboration, before and after 2022. \textit{PLOS ONE}'s multidisciplinary scope and consistent editorial framework allow cross-field comparison under uniform conditions over an extended period. Manuscript length increased substantially, with gains ranging from 14.8\% among African-affiliated authors and 11.7\% among Asian-affiliated authors to 5.3\% among native English-speaking (NES) authors, cutting the word-count gap by 39\%. More strikingly, non-native English-speaking (NNES) authors reduced both authorship team size, from 6.54 to 6.06 authors, or 7.3\%, and collaboration with NES co-authors, from 17.8\% to 12.2\%, or 36\%, while NES authors remained stable in both team size and collaboration rates. Reference counts increased modestly and uniformly across groups. These findings suggest that post-2022 tools may be res
Population genetics lies at the heart of evolutionary theory. This topic forms part of many biological science curricula but is rarely taught to physics students. Since physicists are becoming increasingly interested in biological evolution, we aim to provide a brief introduction to population genetics, written for physicists. We start with two background chapters: chapter 1 provides a brief historical introduction to the topic, while chapter 2 provides some essential biological background. We begin our main content with chapter 3 which discusses the key concepts behind Darwinian natural selection and Mendelian inheritance. Chapter 4 covers the basics of how variation is maintained in populations, while chapter 5 discusses mutation and selection. In chapter 6 we discuss stochastic effects in population genetics using the Wright-Fisher model as our example, and finally we offer concluding thoughts and references to excellent textbooks in chapter 7.
In this the first of an anticipated four paper series, fundamental results of quantitative genetics are presented from a first principles approach. While none of these results are in any sense new, they are presented in extended detail to precisely distinguish between definition and assumption, with a further emphasis on distinguishing quantities from their usual approximations. Terminology frequently encountered in the field of human genetic disease studies will be defined in terms of their quantitive genetics form. Methods for estimation of both quantitative genetics and the related human genetics quantities will be demonstrated. While practitioners in the field of human quantitative disease studies may find this work pedantic in detail, the principle target audience for this work is trainees reasonably familiar with population genetics theory, but with less experience in its application to human disease studies. We introduce much of this formalism because in later papers in this series, we demonstrate that common areas of confusion in human disease studies can be resolved be appealing directly to these formal definitions. The second paper in this series will discuss polygenic ri
This study aims to evaluate the accuracy of authorship attributions in scientific publications, focusing on the fairness and precision of individual contributions within academic works. The study analyzes 81,823 publications from the journal PLOS ONE, covering the period from January 2018 to June 2023. It examines the authorship attributions within these publications to try and determine the prevalence of inappropriate authorship. It also investigates the demographic and professional profiles of affected authors, exploring trends and potential factors contributing to inaccuracies in authorship. Surprisingly, 9.14% of articles feature at least one author with inappropriate authorship, affecting over 14,000 individuals (2.56% of the sample). Inappropriate authorship is more concentrated in Asia, Africa, and specific European countries like Italy. Established researchers with significant publication records and those affiliated with companies or nonprofits show higher instances of potential monetary authorship. Our findings are based on contributions as declared by the authors, which implies a degree of trust in their transparency. However, this reliance on self-reporting may introduc
Imaging genetics is a growing field that employs structural or functional neuroimaging techniques to study individuals with genetic risk variants potentially linked to specific illnesses. This area presents considerable challenges to statisticians due to the heterogeneous information and different data forms it involves. In addition, both imaging and genetic data are typically high-dimensional, creating a "big data squared" problem. Moreover, brain imaging data contains extensive spatial information. Simply vectorizing tensor images and treating voxels as independent features can lead to computational issues and disregard spatial structure. This paper presents a novel statistical method for imaging genetics modeling while addressing all these challenges. We explore a Canonical Correlation Analysis based linear model for the joint modeling of brain imaging, genetic information, and clinical phenotype, enabling the simultaneous detection of significant brain regions and selection of important genetic variants associated with the phenotype outcome. Scalable algorithms are developed to tackle the "big data squared" issue. We apply the proposed method to explore the reaction speed, an i
Contributorship statements have been effective at recording granular author contributions in research articles and have been broadly used to understand how labor is divided across research teams. However, one major limitation in existing empirical studies is that two classification systems have been adopted, especially from its most important data source, journals published by the Public Library of Science (PLoS). This research aims to address this limitation by developing a mapping scheme between the two systems and using it to understand whether there are differences in the assignment of contribution by authors under the two systems. We use all research articles published in PLoS ONE between 2012 to 2020, divided into two five-year publication windows centered by the shift of the classification systems in 2016. Our results show that most tasks (except for writing- and resource-related tasks) are used similarly under the two systems. Moreover, notable differences between how researchers used the two systems are also examined and discussed. This research offers an important foundation for empirical research on division of labor in the future, by enabling a larger dataset that cross
This study proposes a quantitative framework to enhance curriculum coherence through the systematic alignment of Course Learning Outcomes (CLOs) and Program Learning Outcomes (PLOs), contributing to continuous improvement in outcome-based education. Grounded in accreditation standards such as ABET and NCAAA, the model introduces mathematical tools that map exercises, assessment questions, teaching units (TUs), and student assessment components (SACs) to CLOs and PLOs. This dual-layer approach-combining micro-level analysis of assessment elements with macro-level curriculum evaluation-enables detailed tracking of learning outcomes and helps identify misalignments between instructional delivery, assessment strategies, and program objectives. The framework incorporates alignment matrices, weighted relationships, and practical indicators to quantify coherence and evaluate course or program performance. Application of this model reveals gaps in outcome coverage and underscores the importance of realignment, especially when specific PLOs are underrepresented or CLOs are not adequately supported by assessments. The proposed model is practical, adaptable, and scalable, making it suitable f
The problem of inferring unknown graph edges from numerical data at a graph's nodes appears in many forms across machine learning. We study a version of this problem that arises in the field of \emph{landscape genetics}, where genetic similarity between organisms living in a heterogeneous landscape is explained by a weighted graph that encodes the ease of dispersal through that landscape. Our main contribution is an efficient algorithm for \emph{inverse landscape genetics}, which is the task of inferring this graph from measurements of genetic similarity at different locations (graph nodes). Inverse landscape genetics is important in discovering impediments to species dispersal that threaten biodiversity and long-term species survival. In particular, it is widely used to study the effects of climate change and human development. Drawing on influential work that models organism dispersal using graph \emph{effective resistances} (McRae 2006), we reduce the inverse landscape genetics problem to that of inferring graph edges from noisy measurements of these resistances, which can be obtained from genetic similarity data. Building on the NeurIPS 2018 work of Hoskins et al. 2018 on learn
As the importance of research data gradually grows in sciences, data sharing has come to be encouraged and even mandated by journals and funders in recent years. Following this trend, the data availability statement has been increasingly embraced by academic communities as a means of sharing research data as part of research articles. This paper presents a quantitative study of which mechanisms and repositories are used to share research data in PLOS ONE articles. We offer a dynamic examination of this topic from the disciplinary and temporal perspectives based on all statements in English-language research articles published between 2014 and 2020 in the journal. We find a slow yet steady growth in the use of data repositories to share data over time, as opposed to sharing data in the paper or supplementary materials; this indicates improved compliance with the journal's data sharing policies. We also find that multidisciplinary data repositories have been increasingly used over time, whereas some disciplinary repositories show a decreasing trend. Our findings can help academic publishers and funders to improve their data sharing policies and serve as an important baseline dataset
A common sample descriptor in human genomics studies is that of 'genetic ancestry group', with terms such as 'European genetic ancestry' or 'East Asian genetic ancestry' frequently used in publications to describe the genetics of groups of individuals based on the analysis of their genotypes. In this Perspective, I argue that these terms are imprecise and potentially misleading and that, for most applications, simple statements of genetic similarity represent a more accurate description.
PLOS and Mozilla conducted a month-long pilot study in which professional developers performed code reviews on software associated with papers published in PLOS Computational Biology. While the developers felt the reviews were limited by (a) lack of familiarity with the domain and (b) lack of two-way contact with authors, the scientists appreciated the reviews, and both sides were enthusiastic about repeating the experiment.
We analyzed the longitudinal activity of nearly 7,000 editors at the mega-journal PLOS ONE over the 10-year period 2006-2015. Using the article-editor associations, we develop editor-specific measures of power, activity, article acceptance time, citation impact, and editorial renumeration (an analogue to self-citation). We observe remarkably high levels of power inequality among the PLOS ONE editors, with the top-10 editors responsible for 3,366 articles -- corresponding to 2.4% of the 141,986 articles we analyzed. Such high inequality levels suggest the presence of unintended incentives, which may reinforce unethical behavior in the form of decision-level biases at the editorial level. Our results indicate that editors may become apathetic in judging the quality of articles and susceptible to modes of power-driven misconduct. We used the longitudinal dimension of editor activity to develop two panel regression models which test and verify the presence of editor-level bias. In the first model we analyzed the citation impact of articles, and in the second model we modeled the decision time between an article being submitted and ultimately accepted by the editor. We focused on two va
It is widely accepted that population genetics theory is the cornerstone of evolutionary analyses. Empirical tests of the theory, however, are challenging because of the complex relationships between space, dispersal, and evolution. Critically, we lack quantitative validation of the spatial models of population genetics. Here we combine analytics, on and off-lattice simulations, and experiments with bacteria to perform quantitative tests of the theory. We study two bacterial species, the gut microbe Escherichia coli and the opportunistic pathogen Pseudomonas aeruginosa, and show that spatio-genetic patterns in colony biofilms of both species are accurately described by an extension of the one-dimensional stepping-stone model. We use one empirical measure, genetic diversity at the colony periphery, to parameterize our models and show that we can then accurately predict another key variable: the degree of short-range cell migration along an edge. Moreover, the model allows us to estimate other key parameters including effective population size (density) at the expansion frontier. While our experimental system is a simplification of natural microbial community, we argue it is a proof
Rare diseases are collectively common, affecting approximately one in twenty individuals worldwide. In recent years, rapid progress has been made in rare disease diagnostics due to advances in DNA sequencing, development of new computational and experimental approaches to prioritize genes and genetic variants, and increased global exchange of clinical and genetic data. However, more than half of individuals suspected to have a rare disease lack a genetic diagnosis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was initiated to study thousands of challenging rare disease cases and families and apply, standardize, and evaluate emerging genomics technologies and analytics to accelerate their adoption in clinical practice. Further, all data generated, currently representing ~7500 individuals from ~3000 families, is rapidly made available to researchers worldwide via the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) to catalyze global efforts to develop approaches for genetic diagnoses in rare diseases (https://gregorconsortium.org/data). The majority of these families have undergone prior clinical genetic testing
In this paper, we propose a framework for automatic classification of patients from multimodal genetic and brain imaging data by optimally combining them. Additive models with unadapted penalties (such as the classical group lasso penalty or $L_1$-multiple kernel learning) treat all modalities in the same manner and can result in undesirable elimination of specific modalities when their contributions are unbalanced. To overcome this limitation, we introduce a multilevel model that combines imaging and genetics and that considers joint effects between these two modalities for diagnosis prediction. Furthermore, we propose a framework allowing to combine several penalties taking into account the structure of the different types of data, such as a group lasso penalty over the genetic modality and a $L_2$-penalty on imaging modalities. Finally , we propose a fast optimization algorithm, based on a proximal gradient method. The model has been evaluated on genetic (single nucleotide polymorphisms-SNP) and imaging (anatomical MRI measures) data from the ADNI database, and compared to additive models. It exhibits good performances in AD diagnosis; and at the same time, reveals relationships
In this article, we describe highly cited publications in a PLOS ONE full-text corpus. For these publications, we analyse the citation contexts concerning their position in the text and their age at the time of citing. By selecting the perspective of highly cited papers, we can distinguish them based on the context during citation even if we do not have any other information source or metrics. We describe the top cited references based on how, when and in which context they are cited. The focus of this study is on a time perspective to explain the nature of the reception of highly cited papers. We have found that these references are distinguishable by the IMRaD sections of their citation. And further, we can show that the section usage of highly cited papers is time-dependent: the longer the citation interval, the higher the probability that a reference is cited in a method section.
In order to capture the effects of social ties in knowledge diffusion, this paper examines the publication network that emerges from the collaboration of researchers, using citation information as means to estimate knowledge flow. For this purpose, we analyzed the papers published in the PLOS ONE journal finding strong evidence to support that the closer two authors are in the co-authorship network, the larger the probability that knowledge flow will occur between them. Moreover, we also found that when it comes to knowledge diffusion, strong co-authorship proximity is more determinant than geographic proximity.
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDELproject (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.
The particular day of the week when an event occurs seems to have unexpected consequences. For example, the day of the week when a paper is submitted to a peer reviewed journal correlates with whether that paper is accepted. Using an econometric analysis (a mix of log-log and semi-log based on undated and panel structured data) we find that more papers are submitted to certain peer review journals on particular weekdays than others, with fewer papers being submitted on weekends. Seasonal effects, geographical information as well as potential changes over time are examined. This finding rests on a large (178 000) and reliable sample; the journals polled are broadly recognized (Nature, Cell, PLOS ONE and Physica A). Day of the week effect in the submission of accepted papers should be of interest to many researchers, editors and publishers, and perhaps also to managers and psychologists.
The article continues an analysis of the genetic 8-dimensional Yin-Yang-algebra. This algebra was revealed in a course of matrix researches of structures of the genetic code and it was described in the author's articles arXiv:0803.3330 and arXiv:0805.4692. The article presents data about many kinds of cyclic permutations of elements of the genetic code in the genetic (8x8)-matrix [C A; U G](3) of 64 triplets, where C, A, U, G are letters of the genetic alphabet. These cyclic permutations lead to such reorganizations of the matrix form of presentation of the initial genetic Yin-Yang-algebra that arisen matrices serve as matrix forms of presentations of new Yin-Yang-algebras as well. They are connected algorithmically with Hadamard matrices. The discovered existence of a hierarchy of the cyclic changes of types of genetic Yin-Yang-algebras allows thinking about new algebraic-genetic models of cyclic processes in inherited biological systems including models of cyclic metamorphoses of animals. These cycles of changes of the genetic 8-dimensional algebras and of their 8-dimensional numeric systems have many analogies with famous facts and doctrines of modern and ancient physiology, med