The ability of Generative AI (GAI) technology to automatically check, synthesize and modify software engineering artifacts promises to revolutionize all aspects of software engineering. Using GAI for software engineering tasks is consequently one of the most rapidly expanding fields of software engineering research, with over a hundred LLM-based code models having been published since 2021. However, the overwhelming majority of existing code models share a major weakness - they are exclusively trained on the syntactic facet of software, significantly lowering their trustworthiness in tasks dependent on software semantics. To address this problem, a new class of "Morescient" GAI is needed that is "aware" of (i.e., trained on) both the semantic and static facets of software. This, in turn, will require a new generation of software observation platforms capable of generating large quantities of execution observations in a structured and readily analyzable way. In this paper, we present a vision and roadmap for how such "Morescient" GAI models can be engineered, evolved and disseminated according to the principles of open science.
Diversity with respect to ethnicity and gender has been studied in open-source and industrial settings for software development. Publication avenues such as academic conferences and journals contribute to the growing technology industry. However, there have been very few diversity-related studies conducted in the context of academia. In this paper, we study the ethnic, gender, and geographical diversity of the authors published in Software Engineering conferences and journals. We provide a systematic quantitative analysis of the diversity of publications and organizing and program committees of three top conferences and two top journals in Software Engineering, which indicates the existence of bias and entry barriers towards authors and committee members belonging to certain ethnicities, gender, and/or geographical locations in Software Engineering conferences and journal publications. For our study, we analyse publication (accepted authors) and committee data (Program and Organizing committee/ Journal Editorial Board) from the conferences ICSE, FSE, and ASE and the journals IEEE TSE and ACM TOSEM from 2010 to 2022. The analysis of the data shows that across participants and commit
Rankings of scholarly journals based on citation data are often met with skepticism by the scientific community. Part of the skepticism is due to disparity between the common perception of journals' prestige and their ranking based on citation counts. A more serious concern is the inappropriate use of journal rankings to evaluate the scientific influence of authors. This paper focuses on analysis of the table of cross-citations among a selection of Statistics journals. Data are collected from the Web of Science database published by Thomson Reuters. Our results suggest that modelling the exchange of citations between journals is useful to highlight the most prestigious journals, but also that journal citation data are characterized by considerable heterogeneity, which needs to be properly summarized. Inferential conclusions require care in order to avoid potential over-interpretation of insignificant differences between journal ratings. Comparison with published ratings of institutions from the UK's Research Assessment Exercise shows strong correlation at aggregate level between assessed research quality and journal citation `export scores' within the discipline of Statistics.
Physics-informed machine learning (PIML) is emerging as a potentially transformative paradigm for modeling complex biomedical systems by integrating parameterized physical laws with data-driven methods. Here, we review three main classes of PIML frameworks: physics-informed neural networks (PINNs), neural ordinary differential equations (NODEs), and neural operators (NOs), highlighting their growing role in biomedical science and engineering. We begin with PINNs, which embed governing equations into deep learning models and have been successfully applied to biosolid and biofluid mechanics, mechanobiology, and medical imaging among other areas. We then review NODEs, which offer continuous-time modeling, especially suited to dynamic physiological systems, pharmacokinetics, and cell signaling. Finally, we discuss deep NOs as powerful tools for learning mappings between function spaces, enabling efficient simulations across multiscale and spatially heterogeneous biological domains. Throughout, we emphasize applications where physical interpretability, data scarcity, or system complexity make conventional black-box learning insufficient. We conclude by identifying open challenges and fu
By treating data and models as the source code, Foundation Models (FMs) become a new type of software. Mirroring the concept of software crisis, the increasing complexity of FMs making FM crisis a tangible concern in the coming decade, appealing for new theories and methodologies from the field of software engineering. In this paper, we outline our vision of introducing Foundation Model (FM) engineering, a strategic response to the anticipated FM crisis with principled engineering methodologies. FM engineering aims to mitigate potential issues in FM development and application through the introduction of declarative, automated, and unified programming interfaces for both data and model management, reducing the complexities involved in working with FMs by providing a more structured and intuitive process for developers. Through the establishment of FM engineering, we aim to provide a robust, automated, and extensible framework that addresses the imminent challenges, and discovering new research opportunities for the software engineering field.
Using the Scopus dataset (1996-2007) a grand matrix of aggregated journal-journal citations was constructed. This matrix can be compared in terms of the network structures with the matrix contained in the Journal Citation Reports (JCR) of the Institute of Scientific Information (ISI). Since the Scopus database contains a larger number of journals and covers also the humanities, one would expect richer maps. However, the matrix is in this case sparser than in the case of the ISI data. This is due to (i) the larger number of journals covered by Scopus and (ii) the historical record of citations older than ten years contained in the ISI database. When the data is highly structured, as in the case of large journals, the maps are comparable, although one may have to vary a threshold (because of the differences in densities). In the case of interdisciplinary journals and journals in the social sciences and humanities, the new database does not add a lot to what is possible with the ISI databases.
Domain adaptation is a widely used method in natural language processing (NLP) to improve the performance of a language model within a specific domain. This method is particularly common in the biomedical domain, which sees regular publication of numerous scientific articles. PubMed, a significant corpus of text, is frequently used in the biomedical domain. The primary objective of this study is to explore whether refining a pre-training dataset using specific quality metrics for scientific papers can enhance the performance of the resulting model. To accomplish this, we employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set, we then evaluate the resulting models on biomedical language understanding tasks from the BLURB benchmark. Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model's performance.
Context: Citations are a key measure of scientific performance in most fields, including software engineering. However, there is limited research that studies which characteristics of articles' metadata (title, abstract, keywords, and author list) are driving citations in this field. Objective: In this study, we propose a simple theoretical model for how citations come to be with respect to article metadata, we hypothesize theoretical linkages between metadata characteristics and citations of articles, and we empirically test these hypotheses. Method: We use multiple regression analyses to examine a data set comprising the titles, abstracts, keywords, and authors of 16,131 software engineering articles published between 1990 and 2020 in 20 highly influential software engineering venues. Results: We find that number of authors, number of keywords, number of question marks and dividers in the title, number of acronyms, abstract length, abstract propositional idea density, and corresponding authors in the core Anglosphere are significantly related to citations. Conclusion: Various characteristics of articles' metadata are linked to the frequency with which the corresponding articles a
A paradigm shift is underway in Software Engineering, with AI systems such as LLMs playing an increasingly important role in boosting software development productivity. This trend is anticipated to persist. In the next years, we expect a growing symbiotic partnership between human software developers and AI. The Software Engineering research community cannot afford to overlook this trend; we must address the key research challenges posed by the integration of AI into the software development process. In this paper, we present our vision of the future of software development in an AI-driven world and explore the key challenges that our research community should address to realize this vision.
Civic grassroots have proven their ability to create useful and scalable software that addresses pressing social needs. Although software engineering plays a fundamental role in the process of creating civic technology, academic literature that analyses the software development processes of civic tech grassroots is scarce. This paper aims to advance the understanding of how civic grassroots tackle the different activities in their software development processes. In this study, we followed the formation of two projects in a civic tech group (Code for Ireland) seeking to understand how their development processes evolved over time, and how the group carried out their work in creating new technology. Our preliminary findings show that such groups are capable of setting up systematic software engineering processes that address software specification, development, validation, and evolution. While they were able to deliver software according to self-specified quality standards, the group has challenges in requirements specification, stakeholder engagement, and reorienting from development to product delivery. Software engineering methods and tools can effectively support the future of ci
A number of journal classification systems have been developed in bibliometrics since the launch of the Citation Indices by the Institute of Scientific Information (ISI) in the 1960s. These systems are used to normalize citation counts with respect to field-specific citation patterns. The best known system is the so-called "Web-of-Science Subject Categories" (WCs). In other systems papers are classified by algorithmic solutions. Using the Journal Citation Reports 2014 of the Science Citation Index and the Social Science Citation Index (n of journals = 11,149), we examine options for developing a new system based on journal classifications into subject categories using aggregated journal-journal citation data. Combining routines in VOSviewer and Pajek, a tree-like classification is developed. At each level one can generate a map of science for all the journals subsumed under a category. Nine major fields are distinguished at the top level. Further decomposition of the social sciences is pursued for the sake of example with a focus on journals in information science (LIS) and science studies (STS). The new classification system improves on alternative options by avoiding the problem
Over twenty years ago, the Software Engineering (SE) research community have been involved with Evidence-Based Software Engineering (EBSE). EBSE aims to inform industrial practice with the best evidence from rigorous research, preferably from systematic literature reviews (SLRs). Since then, SE researchers have conducted many SLRs, perfected their SLR procedures, proposed alternative ways of presenting their results (such as Evidence Briefings), and profusely discussed how to conduct research that impacts practice. Nevertheless, there is still a feeling that SLRs' results are not reaching practitioners. Something is missing. In this vision paper, we introduce Evidence to Decision (EtD) frameworks from the health sciences, which propose gathering experts in panels to assess the existing best evidence about the impact of an intervention in all relevant outcomes and make structured recommendations based on them. The insight we can leverage from EtD frameworks is not their structure per se but all the relevant criteria for making recommendations to practitioners from SLRs. Furthermore, we provide a worked example based on an SE SLR. We also discuss the challenges the SE research and pr
The Pioneer 10/11 spacecraft yielded the most precise navigation in deep space to date. However, their radio-metric tracking data received from the distances between 20--70 astronomical units from the Sun has consistently indicated the presence of a small, anomalous, Doppler frequency drift. The drift is a blue frequency shift that can be interpreted as a sunward acceleration of a_P = (8.74 +/- 1.33) x 10^{-10} m/s^2 for each particular spacecraft. This signal has become known as the Pioneer anomaly; the nature of this anomaly remains unexplained. Recently new Pioneer 10 and 11 radio-metric Doppler and flight telemetry data became available. The newly available Doppler data set is significantly enlarged when compared to the data used in previous investigations and is expected to be the primary source for the investigation of the anomaly. In addition, the flight telemetry files, original project documentation, and newly developed software tools are now used to reconstruct the engineering history of both spacecraft. With the help of this information, a thermal model of the Pioneer vehicles is being developed to study possible contribution of thermal recoil force acting on the two spa
Automatic analysis of biomedical time series such as electroencephalogram (EEG) and electrocardiographic (ECG) signals has attracted great interest in the community of biomedical engineering due to its important applications in medicine. In this work, a simple yet effective bag-of-words representation that is able to capture both local and global structure similarity information is proposed for biomedical time series representation. In particular, similar to the bag-of-words model used in text document domain, the proposed method treats a time series as a text document and extracts local segments from the time series as words. The biomedical time series is then represented as a histogram of codewords, each entry of which is the count of a codeword appeared in the time series. Although the temporal order of the local segments is ignored, the bag-of-words representation is able to capture high-level structural information because both local and global structural information are well utilized. The performance of the bag-of-words model is validated on three datasets extracted from real EEG and ECG signals. The experimental results demonstrate that the proposed method is not only insens
Realisation of significant advances in capabilities of sensors, computing, timing, and communication enabled by quantum technologies is dependent on engineering highly complex systems that integrate quantum devices into existing classical infrastructure. A systems engineering approach is considered to address the growing need for quantum-secure telecommunications that overcome the threat to encryption caused by maturing quantum computation. This work explores a range of existing and future quantum communication networks, specifically quantum key distribution network proposals, to model and demonstrate the evolution of quantum key distribution network architectures. Leveraging Orthogonal Variability Modelling and Systems Modelling Language as candidate modelling languages, the study creates traceable artefacts to promote modular architectures that are reusable for future studies. We propose a variability-driven framework for managing fast-evolving network architectures with respect to increasing stakeholder expectations. The result contributes to the systematic development of viable quantum key distribution networks and supports the investigation of similar integration challenges re
Silicon photonics has been studied as an integratable optical platform where numerous applicable devices and systems are created based on modern physics and state-of-the-art nanotechnologies. The implementation of quantum mechanics has been the driving force of the most intriguing design of photonic structures, since the optical systems are found of great capability and potential in realizing the analogues of quantum concepts and phenomena. Non-Hermitian physics, which breaks the conventional scope of quantum mechanics based on Hermitian Hamiltonian, has been widely explored in the platform of silicon photonics, with promising design of optical refractive index, modal coupling and gain-loss distribution. As we will discuss in this chapter, the unconventional properties of exceptional points and parity-time symmetry realized in silicon photonics have created new opportunities for ultrasensitive sensors, laser engineering, control of light propagation, topological mode conversion, etc. The marriage between the quantum non-Hermiticity and classical silicon platforms not only spurs numerous studies on the fundamental physics, but also enriches the potential functionalities of the integ
Physics-Informed machine learning models have recently emerged with some interesting and unique features that can be applied to reservoir engineering. In particular, physics-informed neural networks (PINN) leverage the fact that neural networks are a type of universal function approximators that can embed the knowledge of any physical laws that govern a given data-set in the learning process, and can be described by partial differential equations. The transient diffusivity equation is a fundamental equation in reservoir engineering and the general solution to this equation forms the basis for Pressure Transient Analysis (PTA). The diffusivity equation is derived by combining three physical principles, the continuity equation, Darcy's equation, and the equation of state for a slightly compressible liquid. Obtaining general solutions to this equation is imperative to understand flow regimes in porous media. Analytical solutions of the transient diffusivity equation are usually hard to obtain due to the stiff nature of the equation caused by the steep gradients of the pressure near the well. In this work we apply physics-informed neural networks to the one and two dimensional diffusiv
Recruiting and retaining highly qualified physics and physical science teachers is critical for maintaining America's global competitiveness. Unfortunately, only one third of the high school teachers in physics have a degree in physics and an even smaller number of physical science teachers in middle school have a good grasp of the scientific content they teach. Moreover, teachers often lack adequate pedagogical content knowledge to teach science effectively. Here, we discuss the development, implementation, and assessment of a course for science and engineering undergraduates designed to increase awareness and help them develop an interest and a deeper appreciation of the intellectual demands of physics teaching. The course focused on increasing student enthusiasm and confidence in teaching by providing well supported teaching opportunities and exposure to physics education research. The course assessment methods include 1) pre/post-test measures of attitude and expectations about science teaching, 2) self and peer evaluation of student teaching, 3) content-based pre/post-tests given to students who received instruction from the student teachers, and 4) audio-taped focus group dis
Machine learning (ML) components are being added to more and more critical and impactful software systems, but the software development process of real-world production systems from prototyped ML models remains challenging with additional complexity and interdisciplinary collaboration challenges. This poses difficulties in using traditional software lifecycle models such as waterfall, spiral, or agile models when building ML-enabled systems. In this research, we apply a Systems Engineering lens to investigate the use of V-Model in addressing the interdisciplinary collaboration challenges when building ML-enabled systems. By interviewing practitioners from software companies, we established a set of 8 propositions for using V-Model to manage interdisciplinary collaborations when building products with ML components. Based on the propositions, we found that despite requiring additional efforts, the characteristics of V-Model align effectively with several collaboration challenges encountered by practitioners when building ML-enabled systems. We recommend future research to investigate new process models, frameworks and tools that leverage the characteristics of V-Model such as the sy
One source of software project challenges and failures is the systematic errors introduced by human cognitive biases. Although extensively explored in cognitive psychology, investigations concerning cognitive biases have only recently gained popularity in software engineering (SE) research. This paper therefore systematically maps, aggregates and synthesizes the literature on cognitive biases in software engineering to generate a comprehensive body of knowledge, understand state of the art research and provide guidelines for future research and practise. Focusing on bias antecedents, effects and mitigation techniques, we identified 65 articles, which investigate 37 cognitive biases, published between 1990 and 2016. Despite strong and increasing interest, the results reveal a scarcity of research on mitigation techniques and poor theoretical foundations in understanding and interpreting cognitive biases. Although bias-related research has generated many new insights in the software engineering community, specific bias mitigation techniques are still needed for software professionals to overcome the deleterious effects of cognitive biases on their work.