We report the single-case trajectory of a 75-year-old retired occupational female therapist with idiopathic Parkinson's disease, Hoehn and Yahr stage 2 at diagnosis. Following progressive impairment despite standard care, she initiated training with Logic Workout (LW) in July 2025 under supervision. Within weeks, she reported meaningful improvements spanning motor function, posture, pain, fine motor skill, mood, sleep consolidation, and disappearing of fatigue. Although single cases cannot establish generalizable efficacy, systematic and chronological documentation can be valuable for hypothesis generation and feasibility assessment in real-world settings prior to controlled trials. We summarize the baseline condition and treatment history, describe the LW intervention, compile self-reported outcomes, and interpret the findings in light of the underlying Logic Workout hypothesis, before concluding with key caveats and perspectives for future research.
When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.
As the deployment of large language models (LLMs) expands, there is an increasing demand for personalized LLMs. One method to personalize and guide the outputs of these models is by assigning a persona -- a role that describes the expected behavior of the LLM (e.g., a man, a woman, an engineer). This study investigates whether an LLM's understanding of social norms varies across assigned personas. Ideally, the perception of a social norm should remain consistent regardless of the persona, since acceptability of a social norm should be determined by the region the norm originates from, rather than by individual characteristics such as gender, body size, or race. A norm is universal within its cultural context. In our research, we tested 36 distinct personas from 12 sociodemographic categories (e.g., age, gender, beauty) across four different LLMs. We find that LLMs' cultural norm interpretation varies based on the persona used and the norm interpretation also varies within a sociodemographic category (e.g., a fat person and a thin person as in physical appearance group) where an LLM with the more socially desirable persona (e.g., a thin person) interprets social norms more accuratel
Data on historical populations often extends no further than numbers of people by broad age-sex group, with nothing on numbers of births or deaths. Demographers studying these populations have experimented with methods that use the data on numbers of people to infer birth and death rates. These methods have, however, received little attention since they were first developed in the 1960s. We revisit the problem of inferring demographic rates from population structure, spelling out the assumptions needed, and specialising the methods to the case where only child-woman ratios are available. We apply the methods to the case of Maori populations in nineteenth-century Aotearoa New Zealand. We find that, in this particular case, the methods reveal as much about the nature of the data as they do about historical demographic conditions.
This paper investigates the subtle and often concealed biases present in Large Language Models (LLMs), focusing on implicit biases that may remain despite passing explicit bias tests. Implicit biases are significant because they influence the decisions made by these systems, potentially perpetuating stereotypes and discrimination, even when LLMs appear to function fairly. Traditionally, explicit bias tests or embedding-based methods are employed to detect bias, but these approaches can overlook more nuanced, implicit forms of bias. To address this, we introduce two novel psychological-inspired methodologies: the LLM Implicit Association Test (IAT) Bias and the LLM Decision Bias, designed to reveal and measure implicit biases through prompt-based and decision-making tasks. Additionally, open-ended generation tasks with thematic analysis of word generations and storytelling provide qualitative insights into the model's behavior. Our findings demonstrate that the LLM IAT Bias correlates with traditional methods and more effectively predicts downstream behaviors, as measured by the LLM Decision Bias, offering a more comprehensive framework for detecting subtle biases in AI systems. Thi
No woman of science has lived a more controversial life nor possessed a most contrasting character than Gabrielle Emilie Le Tonnelier, Marquise du Chatelet. One one hand, she was a woman of great intelligence, a philosopher of science, a student of mathematics, and she was an ardent supporter of Newton and his new laws of physics. At the same time, Emilie du Chatelet was an aristocrat society woman who gambled, enjoyed parties, and had several extramarital affairs, provoking numerous scandals in her native Paris. She was a passionate woman who was at ease conversing with the nobles at the court and with the most renowned scholars of her time. Emilie du Chatelet did not develop theorems and she did not discover new scientific principles. However, she studied mathematics with Maupertuis and Clairaut to better understand the geometrical language in Newton's Principia. In this article, we review some important aspects of this controversial woman of science, exploring her relationship with the greatest scholars of her time.
In this paper, we propose a data-driven method to measure the impact of the 'woman card' exchange between Hillary Clinton and Donald Trump. Building from a unique dataset of the two candidates' Twitter followers, we first examine the transition dynamics of the two candidates' Twitter followers one week before the exchange and one week after. Then we train a convolutional neural network to classify the gender of the followers and unfollowers, and study how women in particular are reacting to the 'woman card' exchange. Our study suggests that the 'woman card' comment has made women more likely to follow Hillary Clinton, less likely to unfollow her and that it has apparently not affected the gender composition of Trump followers.
Wikipedia -- like most peer production communities -- suffers from a basic problem: the amount of work that needs to be done (articles to be created and improved) exceeds the available resources (editor effort). Recommender systems have been deployed to address this problem, but they have tended to recommend work tasks that match individuals' personal interests, ignoring more global community values. In English Wikipedia, discussion about Vital articles constitutes a proxy for community values about the types of articles that are most important, and should therefore be prioritized for improvement. We first analyzed these discussions, finding that an article's priority is considered a function of 1) its inherent importance and 2) its effects on Wikipedia's global composition. One important example of the second consideration is balance, including along the dimensions of gender and geography. We then conducted a quantitative analysis evaluating how four different article prioritization methods -- two from prior research -- would affect Wikipedia's overall balance on these two dimensions; we found significant differences among the methods. We discuss the implications of our results, i
Prior work shows that men and women speak with different levels of confidence, though it's often assumed that these differences are innate or are learned in early childhood. Using academic publishing as a setting, we find that language differences across male and female authors are initially negligible: in first drafts of academic manuscripts, men and women write with similar levels of uncertainty. However, when we trace those early drafts to their published versions, a substantial gender gap in linguistic uncertainty arises. That is, women increase their use of cautionary language through the publication process more than men. We show this increase in the linguistic gender gap varies substantially based on editor assignment. Specifically, our author-to-editor matched dataset allows us to estimate editor-specific fixed effects, capturing how specific editors impact the change in linguistic uncertainty for female authors relative to male authors (the editor's author-gender gap). Editors' author-gender gaps vary widely, and correlate with observable editor characteristics such as societal norms in their country-of-origin, their work history, and the year that they obtained their PhD.
Analogies such as "man is to king as woman is to X" are often used to illustrate the amazing power of word embeddings. Concurrently, they have also been used to expose how strongly human biases are encoded in vector spaces built on natural language, like "man is to computer programmer as woman is to homemaker". Recent work has shown that analogies are in fact not such a diagnostic for bias, and other methods have been proven to be more apt to the task. However, beside the intrinsic problems with the analogy task as a bias detection tool, in this paper we show that a series of issues related to how analogies have been implemented and used might have yielded a distorted picture of bias in word embeddings. Human biases are present in word embeddings and need to be addressed. Analogies, though, are probably not the right tool to do so. Also, the way they have been most often used has exacerbated some possibly non-existing biases and perhaps hid others. Because they are still widely popular, and some of them have become classics within and outside the NLP community, we deem it important to provide a series of clarifications that should put well-known, and potentially new cases into the
We present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. In this work, we introduce the notion of the regard towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. To this end, we collect strategically-generated text from language models and manually annotate the text with both sentiment and regard scores. Additionally, we build an automatic regard classifier through transfer learning, so that we can analyze biases in unseen text. Together, these methods reveal the extent of the biased nature of language model generations. Our analysis provides a study of biases in NLG, bias metrics and correlated human judgments, and empirical evidence on the usefulness of our annotated dataset.
Word embeddings are the standard model for semantic and syntactic representations of words. Unfortunately, these models have been shown to exhibit undesirable word associations resulting from gender, racial, and religious biases. Existing post-processing methods for debiasing word embeddings are unable to mitigate gender bias hidden in the spatial arrangement of word vectors. In this paper, we propose RAN-Debias, a novel gender debiasing methodology which not only eliminates the bias present in a word vector but also alters the spatial distribution of its neighbouring vectors, achieving a bias-free setting while maintaining minimal semantic offset. We also propose a new bias evaluation metric - Gender-based Illicit Proximity Estimate (GIPE), which measures the extent of undue proximity in word vectors resulting from the presence of gender-based predilections. Experiments based on a suite of evaluation metrics show that RAN-Debias significantly outperforms the state-of-the-art in reducing proximity bias (GIPE) by at least 42.02%. It also reduces direct bias, adding minimal semantic disturbance, and achieves the best performance in a downstream application task (coreference resolutio
This paper was motivated by the worldwide May 12 initiative that aims to celebrate, encourage, and inspire women in mathematics. It presents in short how the May 12 initiative has arisen, what are some of the events in the first years, in particular the Generalized functions online workshop that started in 2021 in this context (and has continued as an annual event ever since), and a brief overview of some female mathematicians who have significant scientific contributions and who are the first women in some aspects: Maryam Mirzakhani (the first female mathematician who was awarded with the prestigious Fields medal; the May 12 initiative appeared in her honour), Hypatia (considered to be the earliest known female mathematician), Sofia Kovalevskaya (the first woman who has been awarded a doctorate in mathematics and considered to be the first woman who got a full professorship in mathematics in the modern academic sense), Emmy Noether (the first woman who gave a plenary lecture at the International Congress of Mathematicians), Karen Uhlenbeck (the first woman awarded with the prestigious Abel Prize), and Ingrid Daubechies (the first woman who became a full professor in mathematics at
We interviewed Iris Abt, who studied in Germany, being the only woman finishing studies in her course during that year. She then started her career in neutrino physics, moved to SLAC at the times of SLC, came back to Europe shaping part of the HERA program and made studies on germanium detectors.
This thesis investigates whether large language models (LLMs) can be guided to simulate a consistent personality through prompt engineering. The study explores this concept within the context of a chatbot designed for Speech-Language Pathology (SLP) student training, specifically focused on gender-affirming voice therapy. The chatbot, named Monae Jackson, was created to represent a 32-year-old transgender woman and engage in conversations simulating client-therapist interactions. Findings suggest that with prompt engineering, the chatbot maintained a recognizable and consistent persona and had a distinct personality based on the Big Five Personality test. These results support the idea that prompt engineering can be used to simulate stable personality characteristics in AI chatbots.
We develop a linear one-sex dynamical model of human population reproduction through marriage. In our model, a woman may marry and divorce multiple times; however, only women who are currently married are assumed to bear children. The iterative marriage process is formulated as a three-state compartmental model, which is described by a system of McKendrick equations with a marital birth rate function that depends on the duration of marriage and the age at marriage. To examine the impact of changing nuptiality on fertility, we derive new formulas for the reproduction indices. In particular, the total fertility rate (TFR) is expressed as the product of the total marriage number and the average total marital fertility. Using Japanese vital statistics, we show that our model provides a reasonable estimate of the current TFR and its future trajectory.
Consider the problem: ``If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?" Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer "0.5," which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of "0.5 child." Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often proceed to answer these flawed questions without recognizing their inherent issues, producing results that are logically or scientifically invalid. By analyzing such patterns, we developed a novel method for generating synthetic datasets to evaluate and benchmark the performance of various LLMs in identify
We present seven experiments exploring gender biases in GPT. Initially, GPT was asked to generate demographics of a potential writer of twenty phrases containing feminine stereotypes and twenty with masculine stereotypes. Results show a strong asymmetry, with stereotypically masculine sentences attributed to a female more often than vice versa. For example, the sentence "I love playing fotbal! Im practicing with my cosin Michael" was constantly assigned by ChatGPT to a female writer. This phenomenon likely reflects that while initiatives to integrate women in traditionally masculine roles have gained momentum, the reverse movement remains relatively underdeveloped. Subsequent experiments investigate the same issue in high-stakes moral dilemmas. GPT-4 finds it more appropriate to abuse a man to prevent a nuclear apocalypse than to abuse a woman. This bias extends to other forms of violence central to the gender parity debate (abuse), but not to those less central (torture). Moreover, this bias increases in cases of mixed-sex violence for the greater good: GPT-4 agrees with a woman using violence against a man to prevent a nuclear apocalypse but disagrees with a man using violence ag
Berta Karlik was an Austrian physicist who was not only among the early radioactivity researchers and nuclear physicists in Vienna, but also pioneered a woman's academic career in Austria. She was the first woman at the University of Vienna to acquire the venia legendi in physics, and the first full professor at a philosophical faculty in Austria. For almost thirty years she was the head of the Institute for Radium Research of the Austrian Academy of Sciences.
Artificial neural networks typically struggle in generalizing to out-of-context examples. One reason for this limitation is caused by having datasets that incorporate only partial information regarding the potential correlational structure of the world. In this work, we propose TIDA (Targeted Image-editing Data Augmentation), a targeted data augmentation method focused on improving models' human-like abilities (e.g., gender recognition) by filling the correlational structure gap using a text-to-image generative model. More specifically, TIDA identifies specific skills in captions describing images (e.g., the presence of a specific gender in the image), changes the caption (e.g., "woman" to "man"), and then uses a text-to-image model to edit the image in order to match the novel caption (e.g., uniquely changing a woman to a man while maintaining the context identical). Based on the Flickr30K benchmark, we show that, compared with the original data set, a TIDA-enhanced dataset related to gender, color, and counting abilities induces better performance in several image captioning metrics. Furthermore, on top of relying on the classical BLEU metric, we conduct a fine-grained analysis o