The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. We selected the two periods to capture preprint-publication dynamics before and during the COVID-19 pandemic while avoiding transitional years. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. In addition to the main dataset, a version-history subset provides all available versions of preprints within the two selected periods, enabling analysis of how preprints evolve over time. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (posted on a preprint server), and Gray Zone (potentially published in a journal but unlinked). To enhance reliability, title and author similarity scores w
Preprints have been considered primarily as a supplement to journal-based systems for the rapid dissemination of relevant scientific knowledge and have historically been supported by studies indicating that preprints and published reports have comparable authorship, references, and quality. However, as preprints increasingly serve as an independent medium for scholarly communication rather than precursors to the version of record, it remains uncertain how preprint usage is shaping scientific discourse. Our research revealed that the preprint citations exhibit significantly higher inequality than journal citations, consistently among categories. This trend persisted even when controlling for age and the mean citation count of the journal matched to each of the preprint categories. We also found that the citation inequality in preprints is not solely driven by a few highly cited papers or those with no impact, but rather reflects a broader systemic effect. Whether the preprint is subsequently published in a journal or not does not significantly affect the citation inequality. Further analyses of the structural factors show that preferential attachment does not significantly contribut
This article deals with the early development of preprint communication in high-energy physics, specifically with how preprint communication was formalized in the early 1960s at the European Organization for Nuclear Research (CERN). It employs a sociological conception of infrastructures to ask which practices and technologies of communication structured the use of preprints in the field at the time and subsequently solidified into the research community's communication and information system. The text conducts an archaeology of the early preprint infrastructure along the lines of three systematic-historical explorations. 1. the use of preprints as media to privately and informally share practical instructions and theoretical tools in the fast-moving current of postwar theoretical physics, 2. the institutional and organizational context of library and documentation work at CERN around mid-century, which acted as a backdrop for creating the preprint infrastructure, and 3. the actual formalization of preprint communication into an information system at the CERN library in the early 1960s, which treated preprints as public "current awareness tools" for the benefit of the whole communi
Preprinting has become a norm in fast-paced computing fields such as artificial intelligence (AI) and human-computer interaction (HCI). In this paper, we conducted semistructured interviews with 15 academics in these fields to reveal their motivations and perceptions of preprinting. The results found a close relationship between preprinting and characteristics of the fields, including the huge number of papers, competitiveness in career advancement, prevalence of scooping, and imperfect peer review system - preprinting comes to the rescue in one way or another for the participants. Based on the results, we reflect on the role of preprinting in subverting the traditional publication mode and outline possibilities of a better publication ecosystem. Our study contributes by inspecting the community aspects of preprinting practices through talking to academics.
The growing impact of preprint servers enables the rapid sharing of time-sensitive research. Likewise, it is becoming increasingly difficult to distinguish high-quality, peer-reviewed research from preprints. Although preprints are often later published in peer-reviewed journals, this information is often missing from preprint servers. To overcome this problem, the PreprintResolver was developed, which uses four literature databases (DBLP, SemanticScholar, OpenAlex, and CrossRef / CrossCite) to identify preprint-publication pairs for the arXiv preprint server. The target audience focuses on, but is not limited to inexperienced researchers and students, especially from the field of computer science. The tool is based on a fuzzy matching of author surnames, titles, and DOIs. Experiments were performed on a sample of 1,000 arXiv-preprints from the research field of computer science and without any publication information. With 77.94 %, computer science is highly affected by missing publication information in arXiv. The results show that the PreprintResolver was able to resolve 603 out of 1,000 (60.3 %) arXiv-preprints from the research field of computer science and without any publica
Open science is increasingly recognised worldwide, with preprint posting emerging as a key strategy. This study explores the factors influencing researchers' adoption of preprint publication, particularly the perceived effectiveness of this practice and research intensity indicators such as publication and review frequency. Using open data from a comprehensive survey with 5,873 valid responses, we conducted regression analyses to control for demographic variables. Researchers' productivity, particularly the number of journal articles and books published, greatly influences the frequency of preprint deposits. The perception of the effectiveness of preprints follows this. Preprints are viewed positively in terms of early access to new research, but negatively in terms of early feedback. Demographic variables, such as gender and the type of organisation conducting the research, do not have a significant impact on the production of preprints when other factors are controlled for. However, the researcher's discipline, years of experience and geographical region generally have a moderate effect on the production of preprints. These findings highlight the motivations and barriers associat
The adoption of open science has quickly changed how artificial intelligence (AI) policy research is distributed globally. This study examines the regional trends in the citation of preprints, specifically focusing on the impact of two major disruptive events: the COVID-19 pandemic and the release of ChatGPT, on research dissemination patterns in the United States, Europe, and South Korea from 2015 to 2024. Using bibliometrics data from the Web of Science, this study tracks how global disruptive events influenced the adoption of preprints in AI policy research and how such shifts vary by region. By marking the timing of these disruptive events, the analysis reveals that while all regions experienced growth in preprint citations, the magnitude and trajectory of change varied significantly. The United States exhibited sharp, event-driven increases; Europe demonstrated institutional growth; and South Korea maintained consistent, linear growth in preprint adoption. These findings suggest that global disruptions may have accelerated preprint adoption, but the extent and trajectory are shaped by local research cultures, policy environments, and levels of open science maturity. This paper
Preprints, versions of scientific manuscripts that precede peer review, are growing in popularity. They offer an opportunity to democratize and accelerate research, as they have no publication costs or a lengthy peer review process. Preprints are often later published in peer-reviewed venues, but these publications and the original preprints are frequently not linked in any way. To this end, we developed a tool, PreprintMatch, to find matches between preprints and their corresponding published papers, if they exist. This tool outperforms existing techniques to match preprints and papers, both on matching performance and speed. PreprintMatch was applied to search for matches between preprints (from bioRxiv and medRxiv), and PubMed. The preliminary nature of preprints offers a unique perspective into scientific projects at a relatively early stage, and with better matching between preprint and paper, we explored questions related to research inequity. We found that preprints from low income countries are published as peer-reviewed papers at a lower rate than high income countries (39.6\% and 61.1\%, respectively), and our data is consistent with previous work that cite a lack of reso
Preprints play an increasingly critical role in academic communities. There are many reasons driving researchers to post their manuscripts to preprint servers before formal submission to journals or conferences, but the use of preprints has also sparked considerable controversy, especially surrounding the claim of priority. In this paper, a case study of computer science preprints submitted to arXiv from 2008 to 2017 is conducted to quantify how many preprints have eventually been printed in peer-reviewed venues. Among those published manuscripts, some are published under different titles and without an update to their preprints on arXiv. In the case of these manuscripts, the traditional fuzzy matching method is incapable of mapping the preprint to the final published version. In view of this issue, we introduce a semantics-based mapping method with the employment of Bidirectional Encoder Representations from Transformers (BERT). With this new mapping method and a plurality of data sources, we find that 66% of all sampled preprints are published under unchanged titles and 11% are published under different titles and with other modifications. A further analysis was then performed to
This study of literature focusing on 'AI Policy' over the past decade, found that citations of preprints, publications on platforms such as arXiv, have increased from five percent to forty percent across three major regions: the U.S., U.K. & E.U., and South Korea. We compare regional responses of preprint citations across the global disruptions of COVID-19 and the release of ChatGPT. We discuss driving factors and risks of preprint normalization, which follows the trend in computer science.
Preprint is a version of a scientific paper that is publicly distributed preceding formal peer review. Since the launch of arXiv in 1991, preprints have been increasingly distributed over the Internet as opposed to paper copies. It allows open online access to disseminate the original research within a few days, often at a very low operating cost. This work overviews how preprint has been evolving and impacting the research community over the past thirty years alongside the growth of the Web. In this work, we first report that the number of preprints has exponentially increased 63 times in 30 years, although it only accounts for 4% of research articles. Second, we quantify the benefits that preprints bring to authors: preprints reach an audience 14 months earlier on average and associate with five times more citations compared with a non-preprint counterpart. Last, to address the quality concern of preprints, we discover that 41% of preprints are ultimately published at a peer-reviewed destination, and the published venues are as influential as papers without a preprint version. Additionally, we discuss the unprecedented role of preprints in communicating the latest research data d
The COVID-19 pandemic accelerated the use of preprints, aiding rapid research dissemination but also facilitating the spread of misinformation. This study analyzes media coverage of preprints from 2014 to 2023, revealing a significant post-pandemic decline. Our findings suggest that heightened awareness of the risks associated with preprints has led to more cautious media practices. While the decline in preprint coverage may mitigate concerns about premature media exposure, it also raises questions about the future role of preprints in science communication, especially during emergencies. Balanced policies based on up-to-date evidence are needed to address this shift.
Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews. Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time. Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667
In this study we analyse the key driving factors of preprints in enhancing scholarly communication. To this end we use four groups of metrics, one referring to scholarly communication and based on bibliometric indicators (Web of Science and Scopus citations), while the others reflect usage (usage counts in Web of Science), capture (Mendeley readers) and social media attention (Tweets). Hereby we measure two effects associated with preprint publishing: publication delay and impact. We define and use several indicators to assess the impact of journal articles with previous preprint versions in arXiv. In particular, the indicators measure several times characterizing the process of arXiv preprints publishing and the reviewing process of the journal versions, and the ageing patterns of citations to preprints. In addition, we compare the observed patterns between preprints and non-OA articles without any previous preprint versions in arXiv. We could observe that the "early-view" and "open-access" effects of preprints contribute to a measurable citation and readership advantage of preprints. Articles with preprint versions are more likely to be mentioned in social media and have shorter
We present CiteAssist, a system to automate the generation of BibTeX entries for preprints, streamlining the process of bibliographic annotation. Our system extracts metadata, such as author names, titles, publication dates, and keywords, to create standardized annotations within the document. CiteAssist automatically attaches the BibTeX citation to the end of a PDF and links it on the first page of the document so other researchers gain immediate access to the correct citation of the article. This method promotes platform flexibility by ensuring that annotations remain accessible regardless of the repository used to publish or access the preprint. The annotations remain available even if the preprint is viewed externally to CiteAssist. Additionally, the system adds relevant related papers based on extracted keywords to the preprint, providing researchers with additional publications besides those in related work for further reading. Researchers can enhance their preprints organization and reference management workflows through a free and publicly available web interface.
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not analyzed preprints with respect to citation bias, although they play an increasingly important role in modern scholarly communication. In this paper, we investigate whether preprints are affected by citation bias with respect to the author affiliation. We measure citation bias for bioRxiv preprints and their publisher versions at the institution level and country level, using the Lorenz curve and Gini coefficient. This allows us to mitigate the effects of confounding factors and see whether or not citation biases related to author affiliation have an increased effect on preprint citations. We observe consistent higher Gini coefficients for preprints than those for publisher versions. Thus, we can confirm that citation bias exists and that it is more severe in case of preprints. As preprint
This paper quantifies to which extent preprints in arXiv accelerate scholarly communication. The following subject fields were investigated up to the year 2012: High Energy Physics (HEP), Mathematics, Astrophysics, Quantitative Biology, and Library and Information Science (LIS). Publication and citation data was downloaded from Scopus and matched with corresponding preprints in arXiv. Furthermore, the INSPIRE HEP database was used to retrieve citation data for papers related to HEP. The bibliometric analysis deals with the growth in numbers of articles published having a previous preprint in arXiv and the publication delay, which is defined as the chronological distance between the deposit of a preprint in arXiv and its formal journal publication. Likewise, the citation delay is analyzed, which describes the time it takes until the first citation of preprints, and articles, respectively. Total citation numbers are compared for sets of articles with a previous preprint and those without. The results show that in all fields but biology a significant citation advantage exists in terms of speed and citation rates for articles with a previous preprint version on arXiv.
We have used data from ADS, AAS, and astro-ph, to study the publishing, preprint posting, and citation patterns for papers published in the ApJ in 1999 and 2002. This allowed us to track statistical trends in author demographics, preprint posting habits, and citation rates for ApJ papers as a whole and across various subgroups and types of ApJ papers. The most interesting results are the frequencies of use of the astro-ph server across various subdisciplines of astronomy, and the impact that such posting has on the citation history of the subsequent ApJ papers. By 2002 72% of ApJ papers were posted as astro-ph preprints, but this fraction varies from 22-95% among the subfields studied. A majority of these preprints (61%) were posted after the papers were accepted at ApJ, and 88% were posted or updated after acceptance. On average, ApJ papers posted on astro-ph are cited more than twice as often as those that are not posted on astro-ph. This difference can account for a number of other, secondary citation trends, including some of the differences in citation rates between journals and different subdisciplines. Preprints clearly have supplanted the journals as the primary means for i
Although there was an early experiment in the 1960s with the central distribution of paper preprints in the biomedical sciences, these sciences have not been early adopters of electronic preprint servers. Some barriers to the development of a 'preprint culture' in the biomedical sciences are described. Multiple factors that, from the 1960s, fostered the transition from a paper-based preprint culture in high energy physics to an electronic one are also described. A new revolution in scientific publishing, in which journals come to be regarded as an overlay on electronic preprint databases, will probably overtake some areas of research much more quickly than others.
We analyze the online response to the preprint publication of a cohort of 4,606 scientific articles submitted to the preprint database arXiv.org between October 2010 and May 2011. We study three forms of responses to these preprints: downloads on the arXiv.org site, mentions on the social media site Twitter, and early citations in the scholarly record. We perform two analyses. First, we analyze the delay and time span of article downloads and Twitter mentions following submission, to understand the temporal configuration of these reactions and whether one precedes or follows the other. Second, we run regression and correlation tests to investigate the relationship between Twitter mentions, arXiv downloads and article citations. We find that Twitter mentions and arXiv downloads of scholarly articles follow two distinct temporal patterns of activity, with Twitter mentions having shorter delays and narrower time spans than arXiv downloads. We also find that the volume of Twitter mentions is statistically correlated with arXiv downloads and early citations just months after the publication of a preprint, with a possible bias that favors highly mentioned articles.