Abstracts are considered as an essential part of academic publications and they often constitute the only part of the text that is actually read during a literature search—besides the title. It is common knowledge that the rationale of an abstract is to provide a short description of the content of the text, like a mini-version of the article itself, to provide readers with all the elements they need to decide whether that article is of interest to them, and whether it meets their needs (Cross & Oppenheim, 2006). It would be unthinkable, in today's information overload setting (Landhuis, 2016), to expect readers to go through a whole article—a process that may take hours, when accurately done—just to understand whether they can skip it and move on to the next paper in a list that may comprise thousands of items. Abstracts are a necessity, and thus often follow a standard structure, to improve their readability. In life sciences, they usually comprise an introduction/background section, materials and methods, a results section and a conclusion section, just like a small replica of a study report (Atanassova et al., 2016). In medicine, which often has, for instance, the need to sort out desired clinical trials for metanalysis that can then drive guidelines and decision-making processes, abstracts can even comprise finer-grained sections, like study design or population of interest (Bahadoran et al., 2020). The usefulness of an abstract in an article is exerted and exhausted mostly at the search level. Once a prospective reader has perused it and the article has been deemed of interest, the attention focus shifts to the main text, and the abstract has accomplished its function. It is like a tag that allows readers to choose in the vast amount of the scientific literature (Alspach, 2017). It can be considered part of the metadata of an article. The adoption of modern information technologies is even prompting researchers to devise novel approaches and algorithms to screen and skim through abstracts to make literature searches quicker and better (Afantenos et al., 2005; Hersh, 2021). Abstracts are such a consistent feature of academic publications that is hard to imagine an article, or a journal, without them. The consistent presence of abstracts, however, is a relatively recent feature of scientific papers, as it was notably absent in many journals published in the 19th and the first part of the 20th centuries (Galli et al., 2020). As the function of abstracts is mainly in literature searches, they evolved on par with the increase in the number of scientific publications. The most likely precursor of abstracts is the summary section, which could be frequently found at the end of the article, to sum up what had just been exposed in the body of the text, to distil its content into few lines, with the presumable goal of creating an information structure that made remembering and memorizing the article easier (Vaughan, 1991). This was reflected by a different structure: summaries lacked a material and methods section, because they did not need to anticipate the experimental details of a study, while they often had a list structure, which included the main take-home messages of the study. At a certain point in time, most journals switched from a post-textual summary to a pre-textual abstract, if they did not already possess one. This formal change reflected a turning point in the readers' relation to scientific literature, as journals adapted to a new format that appeared to respond better to the new needs of their readership. Each journal performed the transition at a different moment, but, by the end of the 20th century, virtually all journals in the biomedical fields had abstracts. Some journals, including the Journal of Experimental Medicine (JEM), marked that transition quite visibly. JEM radically changed its format in the first issue of Volume 172 in 1990 (Galli et al., 2020), and a short piece by the then editor in chief, M. McCarty, explained how the journal's new look better responded to contemporaneous aesthetics, including abstracts at the beginning of each article, which was described as a ‘touch of modernization’ (sic), although the editor admitted that this choice was also supported by the fact that readers were ‘now accustomed to look for it in most publications’ (McCarty, 1990). So, in the case of this journal, we possess a safe terminus post quem to date the appearance of abstracts. The purpose of this study is to use quantitative methods to assess stylistic changes in summaries/abstracts over the course of time. The working hypothesis is that the actual transition from a real summary, in the pristine sense of the word, to a new form of abstract, for indexing and retrieving purposes, did not happen for JEM overnight in July 1990, when the editor decided to reposition it at the beginning of the article, but can be traced back to the years preceding 1990. We believe it is important to investigate whether the editor's decision anticipated or followed a change in the text that had already started, independently of his intervention, due to cultural changes in the use of scientific literature that were already running through the scientific society, and how far in the past such changes had emerged, because it can serve as a case study for the changes in the way we interface with scientific literature that are still occurring and that may require journal structure to pivot again in the future. Secondarily, as the function of summaries was primarily the cognitive role of helping readers understand and memorize the content of a study, it could be useful to investigate whether certain characteristics of summaries that have been lost as they morphed into abstracts may still be useful in abstracts to improve article visibility, an ever-difficult endeavour in today's scientific world. This study analysed the corpus of abstracts of all the articles that appeared in the JEM since its foundation. To generate this corpus, the python litter-getter library was used, and a Medline search was carried out through PubMed API using the search term ‘The Journal of experimental medicine [Journal]’. This retrieved an XML file for each indexed article, which was then used to create a pandas Dataframe (Mckinney, 2010) by means of the BeautifulSoup library. The extracted data for each article were ‘PMID’, ‘Title’, ‘Abstract’, ‘Year’, ‘Volume’, ‘Issue’, ‘Type’ (meaning the type of item, i.e. article, review, commentary etc), ‘Authors’, ‘Affiliation’ and ‘Country’. For embeddings analysis, the text of the abstracts was pre-processed by lowercasing the text, removing stop-words using the Gensim library (Řehůřek & Sojka, 2010), removing punctuations and numbers. The text was then passed into the Spacy library (Honnibal & Montani, 2017), using the large English vocabulary and Principal component analysis was used for dimensionality reduction (scikit-learn implementation). The lowercased text of the abstracts (but without further pre-processing) was also passed into the proprietary software Linguistic Inquiry and Word Count (Pennebaker et al., 2015) for further analysis. LIWC is based on a series of user-defined dictionaries (Pennebaker et al., 2003), which are used to define scoring variables associated with specific thematic spheres (Donohue et al., 2014). Most of these scores are based on how well the sample text matches a specific dictionary. LIWC 2015 also includes four non-transparent summary variables, namely Analytical Thinking, Clout, Authenticity, and Emotional Tone, which are based on Pennebaker's previous published research on text corpora (Pennebaker et al., 2015; Tausczik & Pennebaker, 2010). As the name implies, Analytical Thinking refers to the use of a logical and consistent language, while Clout is associated with a self-confident and authoritative attitude (Kacewicz et al., 2014). A high authenticity score indicates an unhedged, not detached language, while the Emotional Tone score refers to the quality of the emotions that permeate the text. LIWC also analyzes the parts of speech of the texts and quantifies, among other, the use of personal pronouns, adjectives, verbs, and so forth (as relative frequencies). The software also infers the time focus of a text, based on the use of temporal expressions. Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021) libraries were then used to plot the data. All analysis was conducted on Jupyter notebooks (Kluyver et al., 2016). Our PubMed search retrieved 24,012 articles published in the JEM, from 1896, when the first issue was published at the John Hopkins University, to the present day. The number of published items steadily increased over the decades and peaked around the 1990s (Fig. 1). Noticeably, fewer articles appeared in the following decades, and this may possibly reflect the fortunes of the journal on the market or its popularity in the scientific community. Its non-exponential growth, however, partially balances out the distribution of the articles over the years and may prove advantageous for further analysis. Only 969 articles did not have a summary or an abstract (Fig. 2), and these were quite evenly distributed across the decades, with most articles, however, appearing either before the 1920s or after the 1990s. Though, however, in the case of early studies, the lack of a summary is most likely due to a lack of standardization in the format of reporting—as these articles do not otherwise visibly differ from the articles with summary—, in the case of the most recent publications, the ones without abstracts fall within the editorial genre or are simply corrections of previously published works (data not shown). These studies, however, were obviously not considered in the subsequent analysis. Our first overview of the data aimed at assessing whether a semantic change in summaries/abstracts could be detected, using standard NLP approaches. To do that, we obtained embeddings, that is, dense vector representations of the texts, using the free Spacy library. Vectors are basically an ordered array of real numbers, and they are a convenient mathematical notation that is commonly used in several NLP technologies to represent words or even sentences. Based on the distributional semantics hypothesis, that is, that the meanings of words that co-occur frequently in a corpus of texts are likely to be associated, vector semantics attributes similar vectors to co-occurring words thanks to complex algorithms, which may include neural networks (Konstantinov et al., 2021). A whole sentence can be represented by vectors, most often by the mean of the vectors of the individual words making up the sentence. The Spacy library generated a vector of length 300 (i.e. an array of 300 real numbers) for each summary/abstract, which we then plotted after dimensionality reduction by Principal Component Analysis (Fig. 3). This procedure ‘compressed’ each 300-dimensional vector into a 2-dimensional tuple, which was conveniently used as a set of x, y coordinates to plot them in a 2D graph. Appendix Table A1 summarizes an example of abstract before and after pre-processing and its corresponding embedding, together with the result of the PCA dimensionality reduction. Figure 3 shows an elongated cloud of dots, where each point represents a summary/abstract and the colours express the publication decade. The figure clearly shows that the texts are not homogeneously distributed, but older texts (e.g. purple) seem to segregate at one pole of the cloud, and the more recent ones (e.g. khaki and orange) are mostly located at the opposite pole. Since the cloud almost looks as if it was formed by the coalescence of two smaller clouds (indicated by two circles in Fig. 3), we decided to run a k-means clustering algorithm, with k = 2. A clustering algorithm is a mathematical procedure that automatically assigns a set of items to a preset number of clusters, finding the best way to cluster the data into homogeneous groups. Unsurprisingly, the algorithm split the articles at the junction of the two apparent clouds, and once plotted by publication year, it became apparent that the first cloud contained articles mostly published before the 1970s, whereas the second cloud was mostly composed by articles published after that decade (Fig. 3). We then proceeded to consider the linguistic surface of the texts, and several characteristics could be observed changing over time. Our data suggest that the use of ‘I’ increased in the 1970s and 1980s, only to again in articles published in the the use of has steadily increased since the The choice of the first over the also have to do with the number of of publications, in recent require because of their (Fig. and their as by was pronouns, such as even more second pronouns, are in scientific which are usually on the results obtained by the (data not shown). with a use of (Fig. a more has (Fig. and a increase in the 1970s as with the preceding To language, LIWC on specific which include words from words (e.g. (e.g. (e.g. (e.g. and (e.g. & Pennebaker, 2010). The score is then generated as the mean of of words in the dictionary. We then the four LIWC non-transparent summary Analytical Clout, and Emotional (Fig. Unsurprisingly, all abstracts, of their publication high on the Analytical which is an associated with the use of a more and language, as it is from scientific papers, and change over time was (Fig. The Clout which is associated to the use of authoritative language, steadily increased from the with by et & 2017). articles published on JEM did not a change in the whereas the which the of from the text, peaked around the 1920s and then steadily The results with the Clout are by the data in Fig. because personal are part of the words that make up that We then the temporal in this corpus (Fig. The temporal focus of the text is by LIWC again through the use of specific dictionaries that comprise associated to the past (e.g. to the present (e.g. is, or to the (e.g. & Pennebaker, 2010). Once the score is the mean of of words in the over the number of a focus was as it may be anticipated in texts that on a set of conducted in a recent it is, however, that LIWC was to up a focus on the which peaked around the first years of the 20th only to then and and a focus on the which peaked around the and then after the The presence of abstracts is in scientific Abstracts almost a scientific and are often used as a sample to assess the of the et al., journals at the beginning of the 20th century, however, did not possess abstracts but at the end of the main text. abstracts and summaries a of the main text, they differ (Vaughan, 1991). The transition from summaries to abstracts, was more just a of a piece of text. It was a of the text The Journal of Experimental Medicine is a useful case study to investigate this through summaries published in JEM during the 1980s, it apparent that these texts had already evolved from what they used to be almost a The editor's move was likely to reflect a change in the of readers that had to take possibly decades, As similar changes may still be occurring in the literature, even in the of a formal of the text, we to investigate how articles had changed by the time the editorial decided to the structure of their articles We decided to investigate this change in a way through a To this we first obtained vector representations of the content of using a common NLP representations of a text to conveniently mathematical methods to texts, including the of and The semantic distribution of the content of these texts was by two (Fig. 3), which out to mostly include articles published up to the 1970s or after that These data would that the 1970s represent a sort of moment, or a so to in the of This of analysis not provide details as to what these two of texts, whether the vocabulary used, or their for research To further that, we had to into the of the text, and we decided to take of a proprietary software that, based on the use of certain assigns scores to specific of the text & Pennebaker, et al., 2014). As the focus of the present study is not interest more on the characteristics that are by LIWC on the actual that LIWC and how they can to and LIWC several that had changed over time. the use of personal in JEM abstracts changed quite in the of the 20th (Fig. A vast amount of literature has the increase in the use of pronouns, which are often a choice for the structure of these texts, in several fields et al., 2021). A scientific article usually an or a series of conducted by the and the use of first personal can be traced back to the first of scientific journal in the their use was already in academic publications at the beginning of the 20th & common of scientific have that first personal were to be on the are often in scientific and texts, as they to the of an thus making the sentence more by to this may even to for what is This however, is and the use of ‘I’ and is a more common as they are as to make and and thus to be over The in the use of the the may be by several including the cultural of the with that, analysis has also a in since the 1970s, with abstracts deemed more older texts (Fig. in with previous & 2017). this is consistent with similar in and not in all of can be several which from a of change in use in the to an increased need of in the life to with to make for & & 2017). This to a of also be in a as English has as the academic of and of scientific by are which may and the quality of the & et al., 2017). these are linguistic of the texts, LIWC allows for of the LIWC summary are among the main of this These include Analytical Clout, and Emotional (Fig. Clout and changes over time. The way Clout is an increase in the of can increase this which is the and that are by the As we in Fig. the of ‘I’ and has increased since the 1970s, so this may in part the increase in the Clout as these are it is to clearly assess whether a use in personal may have the change in Clout a or was appeared in the but we found that and are but not of the et al., In this because its readers are through the they are the experimental the results obtained and the that the to certain from them. We to that were and time the did they would a certain which a time This in the to their scientific The are the of how they conducted the study, for a moment, we are at their at the and we are with them through that specific in time. that the and which is by and without et al., 2021) The their in an they are in the paper that this is how this specific process in Our personal to the not role we are of that have been by the but the whole process of the that was to their results is not is not to the of this be to the has been described as when the a personal in the is similar to the and is to & 1991). are as more and et al., 2020), so it is that this which has been and (Pennebaker et al., is at partially a more to in the This would be also by the temporal focus of texts before the (Fig. LIWC a present and past time that in the case of the the second of the 20th The temporal focus is not only by the use of specific but also by and time expressions. This result would be consistent with the more common use of in the first of the So, quite abstracts would have from the use of which are often as a more way of but at the time to a with the reader through a more and a use of these changes in JEM, for the most around the 1970s, years before the format and around summaries changed quite in content and in the This is in with biomedical journals we in a recent publication (Galli et al., 2020), which abstracts around that decade. can be forth to articles changed so visibly in and a is possibly the of methods to literature searches, with the to these data suggest that the editor's move to abstracts was and it responded to a change in the way summaries were composed that was already well and had been for at a of These data also the use of in older a that could quite to new that with the readers that, the changes in to still be so to have to these complex and better investigate how literature has changed the way scientific literature is and how the changing we scientific texts are still on this important of knowledge In the present study has a series of changes in the of summaries/abstracts over time using NLP embeddings and LIWC including a in a use of personal pronouns, together with of more and while many can be observed when we consider the scientific of this journal since its at the end of the 19th century, most of the that were at the time of the transition from summary to abstract had already since the or This that although they had the name that they had in had already to a different by the 1970s, which then in the following decades and the editor's decision was actually quite as to the literature to be to a more to the thanks to the use of a and to this purpose from the use of that quite common in older by within the The of and formal analysis, data and have read and to the published of the important specific specific high length specific together data provide time new is on