Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adapt
Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark
Research on gender and language is tightly knitted to social debates on gender equality and non-discriminatory language use. Psycholinguistic scholars have made significant contributions in this field. However, corpus-based studies that investigate these matters within the context of language use are still rare. In our study, we address the question of how much textual material would actually have to be changed if non-gender-inclusive texts were rewritten to be gender-inclusive. This quantitative measure is an important empirical insight, as a recurring argument against the use of gender-inclusive German is that it supposedly makes written texts too long and complicated. It is also argued that gender-inclusive language has negative effects on language learners. However, such effects are only likely if gender-inclusive texts are very different from those that are not gender-inclusive. In our corpus-linguistic study, we manually annotated German press texts to identify the parts that would have to be changed. Our results show that, on average, less than 1% of all tokens would be affected by gender-inclusive language. This small proportion calls into question whether gender-inclusive
How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene's key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods.
Investigators are often interested in how a treatment affects an outcome for units responding to treatment in a certain way. We may wish to know the effect among units that, for example, meaningfully implemented an intervention, passed an attention check, or demonstrated some important mechanistic response. Simply conditioning on the observed value of the post-treatment variable introduces problematic biases. Further, the identification assumptions required of several existing strategies are often indefensible. We propose the Treatment Reactive Average Causal Effect (TRACE), which we define as the total effect of treatment in the group that, if treated, would realize a particular value of the relevant post-treatment variable. By reasoning about the effect among the "non-reactive" group, we can identify and estimate the range of plausible values for the TRACE. We demonstrate the use of this approach with three examples: (i) learning the effect of police-perceived race on police violence during traffic stops, a case where point identification may be possible; (ii) estimating effects of a community-policing intervention in Liberia, in communities that meaningfully implemented it, and
The notion of liquid water beneath the ice layer at the south polar layered deposits of Mars is an interesting possibility given the implications for astrobiology, and possible human habitation. A body of liquid water located at a depth of 1.5 km has been inferred from radar data in the South Polar Cap. However, the high temperatures that would facilitate the existence of liquid water or brine at that depth are not consistent with estimations of heat flow that are based on the lithosphere's flexure. Attempts to reconcile both issues have been inconclusive or otherwise unsuccessful. Here, we analyse the possible role of subsurface ammonia and methanol in maintaining water in a liquid state at subsurface temperatures that are compatible with the lithosphere strength. Our results indicate that the presence of these compounds at the base of the south polar layered deposits can reconcile the existence of liquid water with previous estimations of surface heat flow.
We quantify resilience with metrics extracted from the historical outage data that is routinely recorded by many distribution utilities. The outage data is coordinated with wind data to relate average outage rates in an area to wind speed measured at a nearby weather station. A past investment in wind hardening would have reduced the outage rates, and the effect of this on metrics can be calculated by sampling a reduced number of the historical outages and recomputing the metrics. This quantifies the impact that the hardening would have had on customers. This is a tangible way to relate an investment in wind resilience to the benefits it would have had on the lived experience of customers that could help make the case for the investment to the public and regulators. We also quantify the impact of earlier or faster restoration on customer metrics and compare this to the impact of investment in hardening. Overall this is a new and straightforward approach to quantify resilience and justify resilience investments to stakeholders that is directly driven by utility data. The approach driven by data avoids complicated models or modeling assumptions.
In the context of our research activities on affective computing and human-robot interaction we are working on both the recognition of human's emotions and the expression of emotions by robots. In our vision, robots will be increasingly present in schools, factories, and homes, and their empathetic behavior may foster their acceptance. In particular, in one of our research, we sought to replicate gestures associated with specific emotions on a social robot, NAO. Our focus was on Ekman's six primary emotions, along with five emotions selected from Plutchik's wheel of emotions. In our opinion the cultural component linked to the expression of emotions through gestures certainly influenced both us and the participants. Thus, we would like to investigate the influence of our culture in the gestural expression of emotion.
The onset of planet formation is actively under debate. Recent mass measurements of disks around protostars suggest an early start of planet formation in the Class 0/I disks. However, dust substructures, one possible signature of forming planets, are rarely observed in the young Class 0/I disks, while they are ubiquitous in the mature Class II disks. It is not clear whether the lack of dust substructures in the Class 0/I disks indicates absence of planets or whether it is due to other physical effects such as temperature and dust opacity. Here we consider the effect of temperature on the ability of planets to produce dust substructures. We prescribe the evolution of the disk and the protostar from Class 0 to Class II phase and calculate the disk temperature using radiative transfer models at various stages of the evolution. We use the mid-plane temperature to calculate the disk scale height and the minimum planet mass needed to open observable dust gaps using the thermal criterion. We find that this minimum planet mass decreases as a function of time. Particularly, we find that if a planet up to ${\sim}5$ M$_{\oplus}$ in the inner ${\sim}5$ au or up to ${\sim}10-50$ M$_{\oplus}$ at
Among solar system objects, comets coming from the Oort Cloud are an elusive population, intrinsically rare and difficult to detect. Nonetheless, as the more pristine objects we can observe, they encapsulate critical cues on the formation of planetary systems and are the focus of many scientific investigations and science missions. The Legacy Survey of Space and Time (LSST), which will start to operate from the Vera C. Rubin Observatory in 2025, is expected to dramatically improve our detection ability of these comets by performing regular monitoring of the Southern sky deep down to magnitude 24.5 with excellent astrometry. However, making straightforward predictions on future LSST detection rates is challenging due to our biased knowledge of the underlying population. This is because identifications to date have been conducted by various surveys or individual observers, often without detailed information on their respective selection functions. Recent efforts to predict incoming flux of Long Period Comets still suffer of the lack of systematic, well-characterized, homogeneous cometary surveys. Here, we adopt a different point of view by asking how much earlier~on known comets on l
We consider the problem of identifying a minimal subset of training data $\mathcal{S}_t$ such that if the instances comprising $\mathcal{S}_t$ had been removed prior to training, the categorization of a given test point $x_t$ would have been different. Identifying such a set may be of interest for a few reasons. First, the cardinality of $\mathcal{S}_t$ provides a measure of robustness (if $|\mathcal{S}_t|$ is small for $x_t$, we might be less confident in the corresponding prediction), which we show is correlated with but complementary to predicted probabilities. Second, interrogation of $\mathcal{S}_t$ may provide a novel mechanism for contesting a particular model prediction: If one can make the case that the points in $\mathcal{S}_t$ are wrongly labeled or irrelevant, this may argue for overturning the associated prediction. Identifying $\mathcal{S}_t$ via brute-force is intractable. We propose comparatively fast approximation methods to find $\mathcal{S}_t$ based on influence functions, and find that -- for simple convex text classification models -- these approaches can often successfully identify relatively small sets of training examples which, if removed, would flip the pr
We investigate the impact of deep generative models on potential social biases in upcoming computer vision models. As the internet witnesses an increasing influx of AI-generated images, concerns arise regarding inherent biases that may accompany them, potentially leading to the dissemination of harmful content. This paper explores whether a detrimental feedback loop, resulting in bias amplification, would occur if generated images were used as the training data for future models. We conduct simulations by progressively substituting original images in COCO and CC3M datasets with images generated through Stable Diffusion. The modified datasets are used to train OpenCLIP and image captioning models, which we evaluate in terms of quality and bias. Contrary to expectations, our findings indicate that introducing generated images during training does not uniformly amplify bias. Instead, instances of bias mitigation across specific tasks are observed. We further explore the factors that may influence these phenomena, such as artifacts in image generation (e.g., blurry faces) or pre-existing biases in the original datasets.
Processing fragments of data collected on a monitored person to find out whether this person is a would-be terrorist (WT) is very challenging. Moreover, the process has proven to be deceptive, with repeated dramatic failures. To address the issue I suggest a mirror simple model to mimic the process at stake. The model considers a collection of ground items which are labelled either Terrorist Connected (TC) or Terrorist Free (TF). To extract the signal from the ground data items I implement an iterated coarse-grained scheme, which yields a giant unique item with a label TC or TF. The results obtained validate the processing scheme with correct outcomes for the full range of proportions of TC items, beside in a specific sub-range. There, a systematic wrong labelling of the giant item is obtained at the benefit of WT, who are wrongly labeled not would-be terrorist (NWT). This flaw proves to be irremovable because it is anchored within the processing itself in connexion with the treatment of uncertain aggregates, which inevitable appear. The ``natural" allocation of uncertain aggregates to the TF label, in tune with the ethical application of the presumption of innocence in force in de
Some people did not get the COVID-19 vaccine even though it was offered at no cost. A monetary incentive might lead people to vaccinate, although existing studies have provided different findings about this effect. We investigate how monetary incentives differ according to individual characteristics. Using panel data with online experiments, we found that (1) subsidies reduced vaccine intention but increased it after controlling heterogeneity; (2) the stronger the social image against the vaccination, the lower the monetary incentive; and (3) persistently unvaccinated people would intend to be vaccinated only if a large subsidy was provided.
To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action's influence on future rewards. Building upon Hindsight Credit Assignment (HCA), we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms. Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: 'Would the agent still have reached this reward if it had taken another action?'. We show that measuring contributions w.r.t. rewarding states, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments. Instead, we measure contributions w.r.t. rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance. We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities. By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based cred
It continues to be alleged that superluminal influences of any sort would be inconsistent with special relativity for the following three reasons: (i) they would imply the existence of a distinguished' frame; (ii) they would allow the detection of absolute motion; and (iii) they would violate the relativity of simultaneity. This paper shows that the first two objections rest upon very elementary misunderstandings of Minkowski geometry and lingering Newtonian intuitions about instantaneity. The third objection has a basis, but rather than invalidating the notion of faster than light influences it points the way to more general conceptions of simultaneity that could allow for quantum nonlocality in a natural way.
Recently Pen and Spergel (1997) have shown that a universe whose energy density is dominated by a frustrated network of non-Abelian TeV-scale cosmic strings could account for a broad class of cosmological observations. In this paper we consider the effects of such a string network on the massive black holes widely believed to inhabit the centers of many galaxies. As these black holes traverse the universe together with their host galaxies, they would intersect a large number of string segments. We argue that such segments would become stuck to the black hole, and be stretched by the hole's motion. Stretching the strings would cause significant deceleration of the black holes. Although the black holes would probably not be removed from the galaxies completely, they would be noticeably displaced from the galactic center of mass -- by at least 5kpc. This displacement seems to be is contradiction to the observational evidence.
Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.
Online platforms, including social media and search platforms, have routinely used their users' data for targeted ads, to improve their services, and to sell to third-party buyers. But an increasing awareness of the importance of users' data privacy has led to new laws that regulate data-sharing by platforms. Further, there have been political discussions on introducing data dividends, that is paying users for their data. Three interesting questions are then: When would these online platforms be incentivized to pay data dividends? How does their decision depend on whether users value their privacy more than the platform's free services? And should platforms invest in protecting users' data? This paper considers various factors affecting the users' and platform's decisions through utility functions. We construct a principal-agent model using a Stackelberg game to calculate their optimal decisions and qualitatively discuss the implications. Our results could inform a policymaker trying to understand the consequences of mandating data dividends.
Many researchers and policymakers have expressed excitement about algorithmic explanations enabling more fair and responsible decision-making. However, recent experimental studies have found that explanations do not always improve human use of algorithmic advice. In this study, we shed light on how people interpret and respond to counterfactual explanations (CFEs) -- explanations that show how a model's output would change with marginal changes to its input(s) -- in the context of pretrial risk assessment instruments (PRAIs). We ran think-aloud trials with eight sitting U.S. state court judges, providing them with recommendations from a PRAI that includes CFEs. We found that the CFEs did not alter the judges' decisions. At first, judges misinterpreted the counterfactuals as real -- rather than hypothetical -- changes to defendants. Once judges understood what the counterfactuals meant, they ignored them, stating their role is only to make decisions regarding the actual defendant in question. The judges also expressed a mix of reasons for ignoring or following the advice of the PRAI without CFEs. These results add to the literature detailing the unexpected ways in which people respo