Understanding who blames or supports whom in news text is a critical research question in computational social science. Traditional methods and datasets for sentiment analysis are, however, not suitable for the domain of political text as they do not consider the direction of sentiments expressed between entities. In this paper, we propose a novel NLP task of identifying directed sentiment relationship between political entities from a given news document, which we call directed sentiment extraction. From a million-scale news corpus, we construct a dataset of news sentences where sentiment relations of political entities are manually annotated. We present a simple but effective approach for utilizing a pretrained transformer, which infers the target class by predicting multiple question-answering tasks and combining the outcomes. We demonstrate the utility of our proposed method for social science research questions by analyzing positive and negative opinions between political entities in two major events: 2016 U.S. presidential election and COVID-19. The newly proposed problem, data, and method will facilitate future studies on interdisciplinary NLP methods and applications.
Blame games tend to follow major disruptions, be they financial crises, natural disasters or terrorist attacks. To study how the blame game evolves and shapes the dominant crisis narratives is of great significance, as sense-making processes can affect regulatory outcomes, social hierarchies, and cultural norms. However, it takes tremendous time and efforts for social scientists to manually examine each relevant news article and extract the blame ties (A blames B). In this study, we define a new task, Blame Tie Extraction, and construct a new dataset related to the United States financial crisis (2007-2010) from The New York Times, The Wall Street Journal and USA Today. We build a Bi-directional Long Short-Term Memory (BiLSTM) network for contexts where the entities appear in and it learns to automatically extract such blame ties at the document level. Leveraging the large unsupervised model such as GloVe and ELMo, our best model achieves an F1 score of 70% on the test set for blame tie extraction, making it a useful tool for social scientists to extract blame ties more efficiently.
This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape Cha
LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-use policy via a gradient-free evolutionary paradigm. EvoTool decomposes agent's tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self-improving loop through three novel mechanisms. Trajectory-Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback-Guided Targeted Mutation then edits only that module via natural-language critique. Diversity-Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT-4.1 and Qwen3-8B, while achieving superior efficiency a
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi-agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce M}ulti-Agent PRompt Optimization (MAPRO), a four-stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language-guided variant of max-product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology-aware refinement mechanism that integrates execution feedback and downstream blames to selectively
Identifying Bug-Inducing Commits (BICs) is fundamental for understanding software defects and enabling downstream tasks such as defect prediction and automated program repair. Yet existing SZZ-based approaches rely on git blame, restricting the search space to commits that directly modified the fixed lines. Our preliminary study on 2,102 validated bug-fixing commits reveals this limitation is significant: 28% of BICs require traversing commit history beyond blame results and 14% are blameless. We present AgenticSZZ, the first approach to apply Temporal Knowledge Graphs (TKGs) to software evolution analysis. AgenticSZZ reframes BIC identification from ranking blame commits into a graph search problem, where temporal ordering is fundamental to causal reasoning about bug introduction. The approach operates in two phases: (1) constructing a TKG that encodes commits with temporal and structural relationships, expanding the search space by traversing file history backward from blame commits and the bug-fixing commit; and (2) leveraging an LLM agent to navigate the graph using specialized tools for candidate exploration and causal analysis. Evaluation on three datasets shows that AgenticS
This paper studies the supply and effects of causal rhetoric in U.S. politics. We define causal rhetoric as assigning responsibility for political outcomes, via claims of blame and merit. Training a supervised classifier, we detect causal rhetoric in over a decade of congressional tweets, finding that its supply has risen rapidly and pervasively, displacing affective messaging. We show that the production of causal rhetoric involves a trade-off between revenues and costs. First, quasi-random variation in Twitter adoption shows that blame increases small-donor revenues by expanding donor count, while merit raises average donation size. Second, fine-grained legislative data suggest that policy ownership determines relative costs: blame is cheaper for opponents, merit for proposers. Finally, causal rhetoric has downstream effects on societal outcomes, fostering protest activity and shaping polarization and institutional trust.
Distribution shifts introduce uncertainty that undermines the robustness and generalization capabilities of machine learning models. While conventional wisdom suggests that learning causal-invariant representations enhances robustness to such shifts, recent empirical studies present a counterintuitive finding: (i) empirical risk minimization (ERM) can rival or even outperform state-of-the-art out-of-distribution (OOD) generalization methods, and (ii) its OOD generalization performance improves when all available covariates, not just causal ones, are utilized. Drawing on both empirical and theoretical evidence, we attribute this phenomenon to hidden confounding. Shifts in hidden confounding induce changes in data distributions that violate assumptions commonly made by existing OOD generalization approaches. Under such conditions, we prove that effective generalization requires learning environment-specific relationships, rather than relying solely on invariant ones. Furthermore, we show that models augmented with proxies for hidden confounders can mitigate the challenges posed by hidden confounding shifts. These findings offer new theoretical insights and practical guidance for desi
Explainable Artificial Intelligence (XAI) is critical for ensuring trust and accountability, yet its development remains predominantly visual. For blind and low-vision (BLV) users, the lack of accessible explanations creates a fundamental barrier to the independent use of AI-driven assistive technologies. This problem intensifies as AI systems shift from single-query tools into autonomous agents that take multi-step actions and make consequential decisions across extended task horizons, where a single undetected error can propagate irreversibly before any feedback is available. This paper investigates the unique XAI requirements of the BLV community through a comprehensive analysis of user interviews and contemporary research. By examining usage patterns across environmental perception and decision support, we identify a significant modality gap. Empirical evidence suggests that while BLV users highly value conversational explanations, they frequently experience "self-blame" for AI failures. The paper concludes with a research agenda for accessible Explainable AI in agentic systems, advocating for multimodal interfaces, blame-aware explanation design, and participatory development.
After major disasters, formal inquiries become arenas where responsibility is publicly contested. While extensive research has examined blame attribution through qualitative and actor-centred approaches, the relational structure of blame within formal accountability processes remains poorly understood. Using evidence from the Grenfell Tower Inquiry, this study analyses the web of blame presented during the Phase 2 closing proceedings, in which Counsel to the Inquiry synthesised how core participants publicly attributed responsibility to one another. We represent this synthesis as a directed network and examine its structural properties using standard tools from network analysis. The resulting configuration is interconnected, with pronounced reciprocity and local clustering, indicating that responsibility claims were articulated within a dense institutional environment rather than as isolated, one-to-one accusations. Comparisons with neutral benchmark models show that several observed features depart from expectations based on simple structural constraints alone, revealing patterned organisation in the public articulation of blame within the Inquiry. By applying network-analytic met
This paper argues that conventional blame practices fall short of capturing the complexity of moral experiences, neglecting power dynamics and discriminatory social practices. It is evident that robots, embodying roles linked to specific social groups, pose a risk of reinforcing stereotypes of how these groups behave or should behave, so they set a normative and descriptive standard. In addition, we argue that faulty robots might create expectations of who is supposed to compensate and repair after their errors, where social groups that are already disadvantaged might be blamed disproportionately if they do not act according to their ascribed roles. This theoretical and empirical gap becomes even more urgent to address as there have been indications of potential carryover effects from Human-Robot Interactions (HRI) to Human-Human Interactions (HHI). We therefore urge roboticists and designers to stay in an ongoing conversation about how social traits are conceptualised and implemented in this technology. We also argue that one solution could be to 'embrace the glitch' and to focus on constructively disrupting practices instead of prioritizing efficiency and smoothness of interactio
The battery state of health (SOH) based on capacity fade and resistance increase is not sufficient for predicting Remaining Useful life (RUL). The electrochemical community blames the path-dependency of the battery degradation mechanisms for our inability to forecast the degradation. The control community knows that the path-dependency is addressed by full state estimation. We show that even the electrode-specific SOH (eSOH) estimation is not enough to fully define the degradation states by simulating infinite possible degradation trajectories and remaining useful lives (RUL) from a unique eSOH. We finally define the deepSOH states that capture the individual contributions of all the common degradation mechanisms, namely, SEI, plating, and mechanical fracture to the loss of lithium inventory. We show that the addition of cell expansion measurement may allow us to estimate the deepSOH and predict the remaining useful life.
A gradual type system allows developers to declare certain types to be enforced by the compiler (i.e., statically typed), while leaving other types to be enforced via runtime checks (i.e., dynamically typed). When runtime checks fail, debugging gradually typed programs becomes cumbersome, because these failures may arise far from the original point where an inconsistent type assumption is made. To ease this burden on developers, some gradually typed languages produce a blame report for a given type inconsistency. However, these reports are sometimes misleading, because they might point to program points that do not need to be changed to stop the error. To overcome the limitations of blame reports, we propose using dynamic program slicing as an alternative approach to help programmers debug run-time type errors. We describe a proof-of-concept for TypeSlicer, a tool that would present dynamic program slices to developers when a runtime check fails. We performed a Wizard-of-Oz user study to investigate how developers respond to dynamic program slices through a set of simulated interactions with TypeScript programs. This formative study shows that developers can understand and apply dy
We review four areas of theoretical computer science which share technical or philosophical ideas with the work of Belnap on his useful four-valued logic. Perhaps surprisingly, the inspiration by Belnap-Dunn logic is acknowledged only in the study of d-frames. The connections of Belnap's work and linear logic, Blame Calculus or the study of LVars are not openly admitted. The key to three of these connections with Belnap's work go via the twist-product representation of bilattices. On the one hand, it allows us to view a large class of models of linear logic as based on Belnap-Dunn logic. On the other hand, d-frames admit two twist-product representation theorems and, also, the key theorem of Blame Calculus is essentially a twist-product representation theorem too, albeit with a strong proof-theoretic flavour.
Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreement text instances have been hard-to-learn is that the conventional aggregated models underperform in extracting useful signals from subjective tasks. Inspired by recent studies demonstrating the effectiveness of learning from raw annotations, we investigate classifying using Multiple Ground Truth (Multi-GT) approaches. Our experiments show an improvement of confidence for the high-disagreement instances.
We develop a feedback theory that includes reinforcing and balancing feedback effects that emerge when colleges compete for reputation, applicants, and tuition revenue. The feedback theory is replicated in a formal duopoly model consisting of two competing colleges. An independent ranking entity determines the relative order of the colleges. College applicants choose between the two colleges based on the rankings and the financial aid offered by the colleges. Contrary to the conventional wisdom that competition lowers prices and benefits consumers, our simulations show that competition between academic institutions for resources and reputation leads to tuition escalation that negatively affects students and their families. Four of the five scenarios -- rankings, a capital campaign, facilities improvements, and an excellence campaign -- increase college tuition, institutional debt, and expenditures per student; only the scenario of ignoring the rankings decreases these measures. By referring to the feedback structure of academic competition, the article makes several recommendations for controlling tuition inflation. This article contributes to the literature on the economics of hig
Everyday decisions often involve many different levels. What connects these higher and lower level decisions hierarchy to one another determines how the cause(s) of failures are interpreted. It is hypothesized that decision confidence guides the assignment of blame to the correct level of hierarchy but this hypothesis has only been tested by manipulation of sensory evidence itself. We examined the consequences of modulating subjective confidence in hierarchical decision making via extra-sensory, social influence. Participants who made hierarchical, motion-plus-bandit decisions also received social information from a partner that advised the participant in the motion task. The strength of social advice -- independently from sensory signals -- modulated the likelihood of strategy change after negative feedback. Our findings therefore provide strong empirical evidence that subjective confidence per se acts as the bridge in assignment of credit and blame to various levels of decision hierarchy.
Scholars have not asked why so many governments created ad hoc scientific advisory bodies (ahSABs) to address the Covid-19 pandemic instead of relying on existing public health infrastructure. We address this neglected question with an exploratory study of the US, UK, Sweden, Italy, Poland, and Uganda. Drawing on our case studies and the blame-avoidance literature, we find that ahSABs are created to excuse unpopular policies and take the blame should things go wrong. Thus, membership typically represents a narrow range of perspectives. An ahSAB is a good scapegoat because it does little to reduce government discretion and has limited ability to deflect blame back to government. Our explanation of our deviant case of Sweden, that did not create and ahSAB, reinforces our general principles. We draw the policy inference that ahSAB membership should be vetted by the legislature to ensure broad membership.
A question we can ask of multi-agent systems is whether the agents' collective interaction satisfies particular goals or specifications, which can be either individual or collective. When a collaborative goal is not reached, or a specification is violated, a pertinent question is whether any agent is to blame. This paper considers a two-agent synchronous setting and a formal language to specify when agents' collaboration is required. We take a deontic approach and use obligations, permissions, and prohibitions to capture notions of non-interference between agents. We also handle reparations, allowing violations to be corrected or compensated. We give trace semantics to our logic, and use it to define blame assignment for violations. We give an automaton construction for the logic, which we use as the base for model checking and blame analysis. We also further provide quantitative semantics that is able to compare different interactions in terms of the required reparations.
Cognitive and psychological studies on morality have proposed underlying linguistic and semantic factors. However, laboratory experiments in the philosophical literature often lack the nuances and complexity of real life. This paper examines how well the findings of these cognitive studies generalize to a corpus of over 30,000 narratives of tense social situations submitted to a popular social media forum. These narratives describe interpersonal moral situations or misgivings; other users judge from the post whether the author (protagonist) or the opposing side (antagonist) is morally culpable. Whereas previous work focuses on predicting the polarity of normative behaviors, we extend and apply natural language processing (NLP) techniques to understand the effects of descriptions of the people involved in these posts. We conduct extensive experiments to investigate the effect sizes of features to understand how they affect the assignment of blame on social media. Our findings show that aggregating psychology theories enables understanding real-life moral situations. Moreover, our results suggest that there exist biases in blame assignment on social media, such as males are more like