We compare the network of aggregated journal-journal citation relations provided by the Journal Citation Reports (JCR) 2012 of the Science and Social Science Citation Indexes (SCI and SSCI) with similar data based on Scopus 2012. First, global maps were developed for the two sets separately; sets of documents can then be compared using overlays to both maps. Using fuzzy-string matching and ISSN numbers, we were able to match 10,524 journal names between the two sets; that is, 96.4% of the 10,936 journals contained in JCR or 51.2% of the 20,554 journals covered by Scopus. Network analysis was then pursued on the set of journals shared between the two databases and the two sets of unique journals. Citations among the shared journals are more comprehensively covered in JCR than Scopus, so the network in JCR is denser and more connected than in Scopus. The ranking of shared journals in terms of indegree (that is, numbers of citing journals) or total citations is similar in both databases overall (Spearman's \r{ho} > 0.97), but some individual journals rank very differently. Journals that are unique to Scopus seem to be less important--they are citing shared journals rather than bein
Rankings of scholarly journals based on citation data are often met with skepticism by the scientific community. Part of the skepticism is due to disparity between the common perception of journals' prestige and their ranking based on citation counts. A more serious concern is the inappropriate use of journal rankings to evaluate the scientific influence of authors. This paper focuses on analysis of the table of cross-citations among a selection of Statistics journals. Data are collected from the Web of Science database published by Thomson Reuters. Our results suggest that modelling the exchange of citations between journals is useful to highlight the most prestigious journals, but also that journal citation data are characterized by considerable heterogeneity, which needs to be properly summarized. Inferential conclusions require care in order to avoid potential over-interpretation of insignificant differences between journal ratings. Comparison with published ratings of institutions from the UK's Research Assessment Exercise shows strong correlation at aggregate level between assessed research quality and journal citation `export scores' within the discipline of Statistics.
This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of context type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world's 7,000+ langu
Using the Scopus dataset (1996-2007) a grand matrix of aggregated journal-journal citations was constructed. This matrix can be compared in terms of the network structures with the matrix contained in the Journal Citation Reports (JCR) of the Institute of Scientific Information (ISI). Since the Scopus database contains a larger number of journals and covers also the humanities, one would expect richer maps. However, the matrix is in this case sparser than in the case of the ISI data. This is due to (i) the larger number of journals covered by Scopus and (ii) the historical record of citations older than ten years contained in the ISI database. When the data is highly structured, as in the case of large journals, the maps are comparable, although one may have to vary a threshold (because of the differences in densities). In the case of interdisciplinary journals and journals in the social sciences and humanities, the new database does not add a lot to what is possible with the ISI databases.
The advantages of neural machine translation (NMT) have been extensively validated for offline translation of several language pairs for different domains of spoken and written language. However, research on interactive learning of NMT by adaptation to human post-edits has so far been confined to simulation experiments. We present the first user study on online adaptation of NMT to user post-edits in the domain of patent translation. Our study involves 29 human subjects (translation students) whose post-editing effort and translation quality were measured on about 4,500 interactions of a human post-editor and a machine translation system integrating an online adaptive learning algorithm. Our experimental results show a significant reduction of human post-editing effort due to online adaptation in NMT according to several evaluation metrics, including hTER, hBLEU, and KSMR. Furthermore, we found significant improvements in BLEU/TER between NMT outputs and professional translations in granted patents, providing further evidence for the advantages of online adaptive NMT in an interactive setup.
A number of journal classification systems have been developed in bibliometrics since the launch of the Citation Indices by the Institute of Scientific Information (ISI) in the 1960s. These systems are used to normalize citation counts with respect to field-specific citation patterns. The best known system is the so-called "Web-of-Science Subject Categories" (WCs). In other systems papers are classified by algorithmic solutions. Using the Journal Citation Reports 2014 of the Science Citation Index and the Social Science Citation Index (n of journals = 11,149), we examine options for developing a new system based on journal classifications into subject categories using aggregated journal-journal citation data. Combining routines in VOSviewer and Pajek, a tree-like classification is developed. At each level one can generate a map of science for all the journals subsumed under a category. Nine major fields are distinguished at the top level. Further decomposition of the social sciences is pursued for the sake of example with a focus on journals in information science (LIS) and science studies (STS). The new classification system improves on alternative options by avoiding the problem
The field of unsupervised machine translation has seen significant advancement from the marriage of the Transformer and the back-translation algorithm. The Transformer is a powerful generative model, and back-translation leverages Transformer's high-quality translations for iterative self-improvement. However, the Transformer is encumbered by the run-time of autoregressive inference during back-translation, and back-translation is limited by a lack of synthetic data efficiency. We propose a two-for-one improvement to Transformer back-translation: Quick Back-Translation (QBT). QBT re-purposes the encoder as a generative model, and uses encoder-generated sequences to train the decoder in conjunction with the original autoregressive back-translation step, improving data throughput and utilization. Experiments on various WMT benchmarks demonstrate that a relatively small number of refining steps of QBT improve current unsupervised machine translation models, and that QBT dramatically outperforms standard back-translation only method in terms of training efficiency for comparable translation qualities.
Using three years of the Journal Citation Reports (2011, 2012, and 2013), indicators of transitions in 2012 (between 2011 and 2013) are studied using methodologies based on entropy statistics. Changes can be indicated at the level of journals using the margin totals of entropy production along the row or column vectors, but also at the level of links among journals by importing the transition matrices into network analysis and visualization programs (and using community-finding algorithms). Seventy-four journals are flagged in terms of discontinuous changes in their citations; but 3,114 journals are involved in "hot" links. Most of these links are embedded in a main component; 78 clusters (containing 172 journals) are flagged as potential "hot spots" emerging at the network level. An additional finding is that PLoS ONE introduced a new communication dynamics into the database. The limitations of the methodology are elaborated using an example. The results of the study indicate where developments in the citation dynamics can be considered as significantly unexpected. This can be used as heuristic information; but what a "hot spot" in terms of the entropy statistics of aggregated cit
Using "Analyze Results" at the Web of Science, one can directly generate overlays onto global journal maps of science. The maps are based on the 10,000+ journals contained in the Journal Citation Reports (JCR) of the Science and Social Science Citation Indices (2011). The disciplinary diversity of the retrieval is measured in terms of Rao-Stirling's "quadratic entropy." Since this indicator of interdisciplinarity is normalized between zero and one, the interdisciplinarity can be compared among document sets and across years, cited or citing. The colors used for the overlays are based on Blondel et al.'s (2008) community-finding algorithms operating on the relations journals included in JCRs. The results can be exported from VOSViewer with different options such as proportional labels, heat maps, or cluster density maps. The maps can also be web-started and/or animated (e.g., using PowerPoint). The "citing" dimension of the aggregated journal-journal citation matrix was found to provide a more comprehensive description than the matrix based on the cited archive. The relations between local and global maps and their different functions in studying the sciences in terms of journal lit
Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares five approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.
Publication patterns of 79 forest scientists awarded major international forestry prizes during 1990-2010 were compared with the journal classification and ranking promoted as part of the 'Excellence in Research for Australia' (ERA) by the Australian Research Council. The data revealed that these scientists exhibited an elite publication performance during the decade before and two decades following their first major award. An analysis of their 1703 articles in 431 journals revealed substantial differences between the journal choices of these elite scientists and the ERA classification and ranking of journals. Implications from these findings are that additional cross-classifications should be added for many journals, and there should be an adjustment to the ranking of several journals relevant to the ERA Field of Research classified as 0705 Forestry Sciences.
The advent of large language models (LLMs) has significantly advanced the field of code translation, enabling automated translation between programming languages. However, these models often struggle with complex translation tasks due to inadequate contextual understanding. This paper introduces a novel approach that enhances code translation through Few-Shot Learning, augmented with retrieval-based techniques. By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments. Our method, based on Retrieval-Augmented Generation (RAG), substantially improves translation quality by providing contextual examples from which the model can learn in real-time. We selected RAG over traditional fine-tuning methods due to its ability to utilize existing codebases or a locally stored corpus of code, which allows for dynamic adaptation to diverse translation tasks without extensive retraining. Extensive experiments on diverse datasets with open LLM models such as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code Instruct, and Mixtral-8x22B, as well as commercial LLM models like
We introduce a novel methodology for mapping academic institutions based on their journal publication profiles. We believe that journals in which researchers from academic institutions publish their works can be considered as useful identifiers for representing the relationships between these institutions and establishing comparisons. However, when academic journals are used for research output representation, distinctions must be introduced between them, based on their value as institution descriptors. This leads us to the use of journal weights attached to the institution identifiers. Since a journal in which researchers from a large proportion of institutions published their papers may be a bad indicator of similarity between two academic institutions, it seems reasonable to weight it in accordance with how frequently researchers from different institutions published their papers in this journal. Cluster analysis can then be applied to group the academic institutions, and dendrograms can be provided to illustrate groups of institutions following agglomerative hierarchical clustering. In order to test this methodology, we use a sample of Spanish universities as a case study. We f
Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks. The neural machine translation models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation. In this paper, we focus on analyzing the properties of the neural machine translation using two models; RNN Encoder--Decoder and a newly proposed gated recursive convolutional neural network. We show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase. Furthermore, we find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.
The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. The main machine translation evaluation metrics have also been used in analysis of the systems. A comparison and implementation of a real-time medical translator is the main focus of our experiments.
The recent advances introduced by neural machine translation (NMT) are rapidly expanding the application fields of machine translation, as well as reshaping the quality level to be targeted. In particular, if translations have to fit some given layout, quality should not only be measured in terms of adequacy and fluency, but also length. Exemplary cases are the translation of document files, subtitles, and scripts for dubbing, where the output length should ideally be as close as possible to the length of the input text. This paper addresses for the first time, to the best of our knowledge, the problem of controlling the output length in NMT. We investigate two methods for biasing the output length with a transformer architecture: i) conditioning the output to a given target-source length-ratio class and ii) enriching the transformer positional embedding with length information. Our experiments show that both methods can induce the network to generate shorter translations, as well as acquiring interpretable linguistic skills.
Dyads of journals related by citations can agglomerate into specialties through the mechanism of triadic closure. Using the Journal Citation Reports 2011, 2012, and 2013, we analyze triad formation as indicators of integration (specialty growth) and disintegration (restructuring). The strongest integration is found among the large journals that report on studies in different scientific specialties, such as PLoS ONE, Nature Communications, Nature, and Science. This tendency towards large-scale integration has not yet stabilized. Using the Islands algorithm, we also distinguish 51 local maxima of integration. We zoom into the cited articles that carry the integration for: (i) a new development within high-energy physics and (ii) an emerging interface between the journals Applied Mathematical Modeling and the International Journal of Advanced Manufacturing Technology. In the first case, integration is brought about by a specific communication reaching across specialty boundaries, whereas in the second, the dyad of journals indicates an emerging interface between specialties. These results suggest that integration picks up substantive developments at the specialty level. An advantage o
Predatory journals are Open Access journals of highly questionable scientific quality. Such journals pretend to use peer review for quality assurance, and spam academics with requests for submissions, in order to collect author payments. In recent years predatory journals have received a lot of negative media. While much has been said about the harm that such journals cause to academic publishing in general, an overlooked aspect is how much articles in such journals are actually read and in particular cited, that is if they have any significant impact on the research in their fields. Other studies have already demonstrated that only some of the articles in predatory journals contain faulty and directly harmful results, while a lot of the articles present mediocre and poorly reported studies. We studied citation statistics over a five-year period in Google Scholar for 250 random articles published in such journals in 2014, and found an average of 2,6 citations per article and that 60 % of the articles had no citations at all. For comparison a random sample of articles published in the approximately 25,000 peer reviewed journals included in the Scopus index had an average of 18,1 cit
Over the past four decades, efforts have been made to develop and evaluate models for Empirical Translation Process Research (TPR), yet a comprehensive framework remains elusive. This article traces the evolution of empirical TPR within the CRITT TPR-DB tradition and proposes the Free Energy Principle (FEP) and Active Inference (AIF) as a framework for modeling deeply embedded translation processes. It introduces novel approaches for quantifying fundamental concepts of Relevance Theory (relevance, s-mode, i-mode), and establishes their relation to the Monitor Model, framing relevance maximization as a special case of minimizing free energy. FEP/AIF provides a mathematically rigorous foundation that enables modeling of deep temporal architectures in which embedded translation processes unfold on different timelines. This framework opens up exciting prospects for future research in predictive TPR, likely to enrich our comprehension of human translation processes, and making valuable contributions to the wider realm of translation studies and the design of cognitive architectures.
This article presents the results of a study involving the reception of a fictional story by Kurt Vonnegut translated from English into Catalan and Dutch in three conditions: machine-translated (MT), post-edited (PE) and translated from scratch (HT). 223 participants were recruited who rated the reading conditions using three scales: Narrative Engagement, Enjoyment and Translation Reception. The results show that HT presented a higher engagement, enjoyment and translation reception in Catalan if compared to PE and MT. However, the Dutch readers show higher scores in PE than in both HT and MT, and the highest engagement and enjoyments scores are reported when reading the original English version. We hypothesize that when reading a fictional story in translation, not only the condition and the quality of the translations is key to understand its reception, but also the participants reading patterns, reading language, and, perhaps language status in their own societies.