共找到 20 条结果
We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within each work, with a mix of central of peripheral questions for each work, not only knowledge of mainstream information, but also in-depth cornerstones defining the heritage of Danish Culture, defined by the Canon committee. Each question-answer pair is humanly approved or corrected in the final dataset consisting of 741 close-ended question answer pairs covering topics, from 1300 BC. archaeological findings, 1700 century poems and musicals pieces to contemporary pop music and Danish design and architecture.
Voting Advice Applications (VAA) are tools designed to help voters compare political candidates on policy preferences prior to elections. VAAs are popular tools in European countries and in other countries with multi-party democratic systems. Through a freedom of information request we got access to the inner workings of a popular Danish VAA called the Kandidattest which is implemented by major Danish news outlet and has been used for general, municipal, and European elections. Users and politicians from every political party answer the same online questionnaire and get matched based on the agreement percentage stemming from their answers. VAAs play a significant role in elections with 45% of surveyed voters reporting they followed its recommendations in the past Danish general election, however, the inner workings of VAAs have not been thoroughly evaluated. We find that the algorithm is not robust enough for users to trust the agreement percentages in the output, as small changes to the algorithm can lead to different results, potentially affecting election results. We conduct an algorithmic audit of the Kandidattest's robustness, using simulated responses to investigate the tool'
The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we believe is fundamental for understanding language-specific word structure. In this study, we leverage an annotated Danish morphological dataset to train a semisupervised model for morphological segmentation, enabling the development of tokenizers optimized for Danish morphology. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance in morphologically segmenting Danish words. Additionally, we train two generative transformer models, \textit{CerebrasGPT-111M} and \textit{LLaMA-3.2 1B}, using these tokenizers and evaluate their downstream performance. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer. In downstream tasks, models trained with our morphological tokenizers outperform those using BPE tokenizers across different evaluation metrics. These resul
On July 14th, 2022, the Danish Data Protection Authority issued a reprimand against Helsingor Municipality. It imposed a general ban on using Google Chromebooks and Google Workspace for education in primary schools in the Municipality. The Danish DPA banned such processing and suspended any related data transfers to the United States (U.S.) until it is brought in line with the General Data Protection Regulation (GDPR). The suspension took effect immediately, and the Municipality had until August 3rd, 2022, to withdraw and terminate the processing, as well as delete data already transferred. Finally, in a new decision on August 18th, 2022, the Danish DPA has ratified the ban to the use of Google Chromebooks and Workspace. In the eyes of the Danish DPA, the Municipality failed for example to document that they have assessed and reduced the relevant risks to the rights and freedoms of the pupils. This article is structured as follows: section II provides the background concerning the unfolding events after the Schrems II ruling. Section III discusses the origins and facts of the Danish DPA case. Section IV examines the reasoning and critical findings of the Danish DPA decision. Finall
While multiple emotional speech corpora exist for commonly spoken languages, there is a lack of functional datasets for smaller (spoken) languages, such as Danish. To our knowledge, Danish Emotional Speech (DES), published in 1997, is the only other database of Danish emotional speech. We present EmoTale; a corpus comprising Danish and English speech recordings with their associated enacted emotion annotations. We demonstrate the validity of the dataset by investigating and presenting its predictive power using speech emotion recognition (SER) models. We develop SER models for EmoTale and the reference datasets using self-supervised speech model (SSLM) embeddings and the openSMILE feature extractor. We find the embeddings superior to the hand-crafted features. The best model achieves an unweighted average recall (UAR) of 64.1% on the EmoTale corpus using leave-one-speaker-out cross-validation, comparable to the performance on DES.
We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.
We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints.
Ringkoebing Fjord is an inland water basin on the Danish west coast separated from the North Sea by a set of gates used to control the amount of water entering and leaving the fjord. Currently, human operators decide when and how many gates to open or close for controlling the fjord's water level, with the goal to satisfy a range of conflicting safety and performance requirements such as keeping the water level in a target range, allowing maritime traffic, and enabling fish migration. Uppaal Stratego. We then use this digital twin along with forecasts of the sea level and the wind speed to learn a gate controller in an online fashion. We evaluate the learned controllers under different sea-level scenarios, representing normal tidal behavior, high waters, and low waters. Our evaluation demonstrates that, unlike a baseline controller, the learned controllers satisfy the safety requirements, while performing similarly regarding the other requirements.
The matching of competences, such as skills, occupations or knowledges, is a key desiderata for candidates to be fit for jobs. Automatic extraction of competences from CVs and Jobs can greatly promote recruiters' productivity in locating relevant candidates for job vacancies. This work presents the first model that jointly extracts and classifies competence from Danish job postings. Different from existing works on skill extraction and skill classification, our model is trained on a large volume of annotated Danish corpora and is capable of extracting a wide range of Danish competences, including skills, occupations and knowledges of different categories. More importantly, as a single BERT-like architecture for joint extraction and classification, our model is lightweight and efficient at inference. On a real-scenario job matching dataset, our model beats the state-of-the-art models in the overall performance of Danish competence extraction and classification, and saves over 50% time at inference.
Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.
Phishing attacks remain a persistent cybersecurity threat, and the widespread adoption of TLS certificates has unintentionally enabled malicious websites to appear trustworthy to users. This study examines whether certificate metadata and domain characteristics can help distinguish phishing domains from benign domains within the Danish .dk namespace. A dataset was constructed by combining registry information from Punktum dk with phishing reports and popularity rankings from external sources. TLS certificate attributes were collected using Netlas, while additional domain-based features were derived from DNS records and lexical analysis of domain names. The analysis compares phishing, popular, and less frequently visited domains across several feature categories, including Certificate Authorities (CAs), validity periods, missing certificate fields, SAN structure, registrant geography, hosting providers, and lexical properties of domain names. The results indicate that several features show observable differences between phishing and highly popular domains. However, phishing domains often resemble less popular domains, resulting in substantial overlap across many characteristics. Con
Background: Clinical natural language processing (NLP) refers to the use of computational methods for extracting, processing, and analyzing unstructured clinical text data, and holds a huge potential to transform healthcare in various clinical tasks. Objective: The study aims to perform a systematic review to comprehensively assess and analyze the state-of-the-art NLP methods for the mainland Scandinavian clinical text. Method: A literature search was conducted in various online databases including PubMed, ScienceDirect, Google Scholar, ACM digital library, and IEEE Xplore between December 2022 and February 2024. Further, relevant references to the included articles were also used to solidify our search. The final pool includes articles that conducted clinical NLP in the mainland Scandinavian languages and were published in English between 2010 and 2024. Results: Out of the 113 articles, 18% (n=21) focus on Norwegian clinical text, 64% (n=72) on Swedish, 10% (n=11) on Danish, and 8% (n=9) focus on more than one language. Generally, the review identified positive developments across the region despite some observable gaps and disparities between the languages. There are substantial
We present FÆRDXEL, a tool for symbolic reasoning in the domain of Danish traffic law. FÆRDXEL combines techniques from logic programming with a novel interface that allows users to navigate through its reasoning process, thereby ensuring the system's explainability. Towards the goal of better understanding the value of FÆRDXEL, two evaluations of the system have been performed: (1) An empirical evaluation showing that for a selection of court cases, the conclusions of FÆRDXEL align with those of Danish judges. (2) A qualitative evaluation from legal experts indicating that this work has potential to become a foundation for real-world AI tools supporting professionals in the Danish legal sector.
The DANish regional atmospheric ReAnalysis (DANRA) is a novel high-resolution (2.5 km) reanalysis dataset covering Denmark and its surrounding regions over a 34-year period (1990-2023). Denmark's complex coastline, with over 400 islands and an extensive 7,400 km coastline, means that most municipalities experience mixed land-sea variability. This complexity requires a regional climate reanalysis that can resolve fine-scale coastal and inland features, as well as their impact on climate variability. DANRA is based on the HARMONIE-AROME Numerical Weather Prediction (NWP) model and assimilates a comprehensive set of observations, with a particular focus on Denmark. Compared to global reanalyses such as the ECMWF Reanalysis v5 (ERA5), DANRA demonstrates superior performance in representing essential climate variables, including near-surface weather parameters during both extreme and ordinary conditions. We illustrate these improvements in the representation of several extreme weather cases over Denmark, such as the December 1999 hurricane-force storm, the July 2022 national temperature record, and the August 2007 cloudburst in South Jutland. DANRA is made to support climate adaptation,
Technologies to monitor the provision of renewable energy are part of emerging technologies to help address the discrepancy between renewable energy production and its related usage in households. This paper presents various ways householders use a technological artifact for the real-time monitoring of renewable energy provision. Such a monitoring thus affords householders with an opportunity to adjust their energy consumption according to renewable energy provision. In Denmark, Ewii, previously Barry, is a Danish energy supplier which provides householders with an opportunity to monitor energy sources in real time through a technological solution of the same name. This paper use provision afforded by Ewii as a case for exploring how householders organize themselves to use a technological artefact that supports the monitoring of energy and its related usage. This study aims to inform technology design through the derivation of four personas. The derived personas highlight the differences in energy monitoring practices for the householders and their engagement. These personas are characterised as dedicated, organised, sporadic, and convenient. Understanding these differences in ener
This paper provides quasi-experimental evidence on how income taxes affect gross hourly wages, utilizing Danish administrative data and a tax reform that introduced joint taxation. Exploiting spousal income for identification, we present nonparametric, difference-in-differences graphical evidence among husbands. For low-income workers, taxes have negative and dynamic effects on wages; their wage elasticity with respect to net-of-marginal-tax rates is 0.4. For medium-income workers, the effects are smaller and insignificant. Wages respond to taxes through promotions or job-to-job transitions. Neither daily nor annual hours worked respond significantly; consequently, annual earnings respond to taxes primarily through hourly wages, rather than through labor supply.
Named entity recognition is one of the cornerstones of Danish NLP, essential for language technology applications within both industry and research. However, Danish NER is inhibited by a lack of available datasets. As a consequence, no current models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains. To alleviate these limitations, this paper introduces: 1) DANSK: a named entity dataset providing for high-granularity tagging as well as within-domain evaluation of models across a diverse set of domains; 2) DaCy 2.6.0 that includes three generalizable models with fine-grained annotation; and 3) an evaluation of current state-of-the-art models' ability to generalize across domains. The evaluation of existing and new models revealed notable performance discrepancies across domains, which should be addressed within the field. Shortcomings of the annotation quality of the dataset and its impact on model training and evaluation are also discussed. Despite these limitations, we advocate for the use of the new dataset DANSK alongside further work on the generalizability within Danish NER.
Chronic Obstructive Pulmonary Disease (COPD) is a serious and debilitating disease affecting millions around the world. Its early detection using non-invasive means could enable preventive interventions that improve quality of life and patient outcomes, with speech recently shown to be a valuable biomarker. Yet, its validity across different linguistic groups remains to be seen. To that end, audio data were collected from 96 Danish participants conducting three speech tasks (reading, coughing, sustained vowels). Half of the participants were diagnosed with different levels of COPD and the other half formed a healthy control group. Subsequently, we investigated different baseline models using openSMILE features and learnt x-vector embeddings. We obtained a best accuracy of 67% using openSMILE features and logistic regression. Our findings support the potential of speech-based analysis as a non-invasive, remote, and scalable screening tool as part of future COPD healthcare solutions.
The beverage industry is a typical food processing industry, accounts for significant energy consumption, and has flexible demands. However, the deployment of energy flexibility in the beverage industry is complex and challenging. Furthermore, activation of energy flexibility from the whole brewery industry is necessary to ensure grid stability. Therefore, this paper assesses the energy flexibility potential of Denmark's brewery sector based on a multi-agent-based simulation. 239 individual brewery facilities are simulated, and each facility, as an agent, can interact with the energy system market and make decisions based on its underlying parameters and operational restrictions. The results show that the Danish breweries could save 1.56 % of electricity costs annually while maintaining operational security and reducing approximately 1745 tonnes of CO2 emissions. Furthermore, medium-size breweries could obtain higher relative benefits by providing energy flexibility, especially those producing lager and ale. The result also shows that the breweries' relative saving potential is electricity market-dependent.
This Data Descriptor introduces the dataset Enevaeldens Nyheder Online (News during Absolutism Online). The Enevaeldens Nyheder Online (ENO) dataset provides a reconstruction of the contents of major newspapers in Denmark and Norway during the period of Absolutism (1660-1849). The dataset contains approx. 474 million words, created using neural networks designed to process digitised microfilm versions of Danish newspapers as well as a smaller selection of Norwegian publications that were all hitherto illegible for computers. The contributions details this process and its results, including a way to derive standalone texts from the editions, and the accompanying BERT-model trained on a beta-version of the dataset.