Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further e
How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical resu
Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p > .05, Welch's t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p < .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies ("Skeptics" vs. "Believers"); and (5) accuracy degrades after approximately 30 sequential evaluations due to cognitive fatigue. The answer, in short, is no: humans cannot reliably tell. These results indicate that user-side detection is not a viable defense and motivate system-level countermeasures such as cryptographic content provenance.
Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to "show, don't tell." We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives' surface forms into higher-level concepts and themes. By running LDA on LMs' retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method's outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.
Recently, the exciting new Fermilab (FNAL) Muon g-2 measurement impressively confirmed the final Brookhaven (BNL) result from 2004, and with a result four times more precise, has launched a new serious attack on the Standard Model (SM). On the theoretical side, ab initio lattice QCD (LQCD) calculations of hadronic vacuum polarization have made remarkable progress. They are now the new standard for studying the leading non-perturbative contributions, which have previously hindered matching with the precision required for full exploitation of the experimental results. The lattice results affected both leading hadronic contributions the hadronic vacuum polarization (HVP) and the hadronic light-by-light (HLbL) contributions by increasing the previously generally accepted $e^+e^-$ to hadrons based dispersion relation results. The shifts reduced the discrepancy between theory and experiment, leaving nothing missing. One of the most prominent signs of Beyond the Standard Model (BSM) physics has disappeared: the SM appears validated more than ever, in agreement with what other searches at the Large Hadron Collider (LHC) at CERN tell us! A triumph of the SM, even though the SM cannot explai
The rise of synthetic media has blurred the boundary between reality and fabrication under the evolving power of artificial intelligence, fueling an infodemic that erodes public trust in cyberspace. For digital imagery, a multitude of editing applications further complicates the forensic analysis, including semantic edits that alter content, photometric adjustments that recalibrate colour characteristics, and geometric projections that reshape viewpoints. Collectively, these transformations manipulate and control perceptual interpretation of digital imagery. This susceptibility calls for forensic enquiry into reconstructing the chain of events, thereby revealing deeper evidential insight into the presence or absence of criminal intent. This study seeks to address an inverse problem of tracing the underlying generation chain that gives rise to the observed synthetic media. A tell-tale watermarking system is developed for explanatory reasoning over the nature and extent of transformations across the lifecycle of synthetic media. Tell-tale watermarks are tailored to different classes of transformations, responding in a manner that is neither strictly robust nor fragile but instead int
Cosine similarity is often used to measure the similarity of vector representations of neural network models. However, the cosine similarity of representations is not guaranteed to tell us anything about model probabilities. In this paper we show that for a softmax classifier, be it an image classifier or an autoregressive language model, the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that given two unembeddings, it is possible to create another model which assigns the same probabilities for all inputs, but where the cosine similarity between the representations is now either 1 or -1. We also show that for a sigmoid classifier (where each input can be assigned multiple labels), all pairwise cosine similarities between the unembeddings define the set of possible label combinations. However, for softmax classifiers (where each input is assigned a ranking of the labels from most to least likely), we need all pairwise cosine similarities between all differences of unembeddings to know which rankings the model can predict. We conclude that it is mislea
Automated traffic continued to surpass human-generated traffic on the web, and a rising proportion of this automation was explicitly malicious. Evasive bots could pretend to be real users, even solve Captchas and mimic human interaction patterns. This work explores a less intrusive, protocol-level method: using TLS fingerprinting with the JA4 technique to tell apart bots from real users. Two gradient-boosted machine learning classifiers (XGBoost and CatBoost) were trained and evaluated on a dataset of real TLS fingerprints (JA4DB) after feature extraction, which derived informative signals from JA4 fingerprints that describe TLS handshake parameters. The CatBoost model performed better, achieving an AUC of 0.998 and an F1 score of 0.9734. It was accurate 0.9863 of the time on the test set. The XGBoost model showed almost similar results. Feature significance analyses identified JA4 components, especially ja4\_b, cipher\_count, and ext\_count, as the most influential on model effectiveness. Future research will extend this method to new protocols, such as HTTP/3, and add additional device-fingerprinting features to test how well the system resists advanced bot evasion tactics.
The availability of extended reality (XR) devices has widened their adoption, yet authoring interactive experiences remains complex for non-programmers. We introduce Tell-XR, an intelligent agent leveraging large language models (LLMs) to guide end-users in defining the interaction in XR settings using automations described as Event-Condition-Action (ECA) rules. Through a formative study, we identified the key conversation stages to define and refine automations, which informed the design of the system architecture. The evaluation study in two scenarios (a VR museum and an AR smart home) demonstrates the effectiveness of Tell-XR across different XR interaction settings.
Continual learning (CL) models are designed to learn new tasks arriving sequentially without re-training the network. However, real-world ML applications have very limited label information and these models suffer from catastrophic forgetting. To address these issues, we propose an unsupervised CL model with task experts called Unsupervised Task Expert Lifelong Learning (U-TELL) to continually learn the data arriving in a sequence addressing catastrophic forgetting. During training of U-TELL, we introduce a new expert on arrival of a new task. Our proposed architecture has task experts, a structured data generator and a task assigner. Each task expert is composed of 3 blocks; i) a variational autoencoder to capture the task distribution and perform data abstraction, ii) a k-means clustering module, and iii) a structure extractor to preserve latent task data signature. During testing, task assigner selects a suitable expert to perform clustering. U-TELL does not store or replay task samples, instead, we use generated structured samples to train the task assigner. We compared U-TELL with five SOTA unsupervised CL methods. U-TELL outperformed all baselines on seven benchmarks and one
Luminous Red Novae (LRNe) have been argued to be related to the ejection of common envelopes (CEs) in binary star systems. Ejection of CEs leads to tightened stellar orbits capable of forming compact binaries that merge in Hubble time. As these mergers are seen by gravitational-wave (GW) detectors such as LIGO, Virgo and KAGRA (LVK), we ask what the merger rates of compact binaries in LVK tell us about the fraction of LRNe that lead to the formation of compact binaries that merge in Hubble time. Using the observed volumetric rates of LRNe from the Zwicky Transient Facility (ZTF) and of compact binary mergers from LVK observations, we derive limits on the fraction of LRNe that produce compact binaries that merge in Hubble time. Assuming the LRNe rate closely follows the star formation rate at any redshift, we use the delay time distribution models for compact binaries to compute the compact binary merger rate. A comparison of this merger rate with the latest volumetric rates of compact binary mergers from the fourth GW transient catalog (GWTC-4) at the present epoch of LVK allows us to constrain the above fraction. We find that only a fraction as small as $\sim 10^{-3}$ (median) of
Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.
The opportunity to tell a white lie (i.e., a lie that benefits another person) generates a moral conflict between two opposite moral dictates, one pushing towards telling always the truth and the other pushing towards helping others. Here we study how people resolve this moral conflict. What does telling a white lie signal about a person's pro-social tendencies? To answer this question, we conducted a two-stage 2x2 experiment. In the first stage, we used a Deception Game to measure aversion to telling a Pareto white lie (i.e., a lie that helps both the liar and the listener), and aversion to telling an altruistic white lie (i.e., a lie that helps the listener at the expense of the liar). In the second stage we measured altruistic tendencies using a Dictator Game and cooperative tendencies using a Prisoner's dilemma. We found three major results: (i) both altruism and cooperation are positively correlated with aversion to telling a Pareto white lie; (ii) both altruism and cooperation are negatively correlated with aversion to telling an altruistic white lie; (iii) men are more likely than women to tell an altruistic white lie, but not to tell a Pareto white lie. Our results shed lig
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
Recent literature highlights the advantages of implementing social rules via dynamic game forms. We characterize when truth-telling remains a dominant strategy in gradual mechanisms implementing strategy-proof social rules, where agents gradually reveal their private information while acquiring information about others in the process. Our first characterization hinges on the incentive-preservation of a basic transformation on gradual mechanisms called illuminating that partitions information sets. The second relies on a single reaction-proofness condition. We demonstrate the usefulness of both characterizations through applications to second-price auctions and the top-trading cycles algorithm.
Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open-vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land-cover). We evaluate 5 open-vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi-class segmentation through a confidence-based mask merging strategy. Our extensive experiments reveal that open-vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high varia
Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available a
In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which often introduces significant latency, computational overhead, and errors. Current advanced image-to-text approaches, which bypass OCR, typically yield plain text outputs without corresponding vision grounding. In this paper, we introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. Distinctively, STNet utilizes a unique <see> token to observe pertinent image areas, aided by a decoder that interprets physical coordinates linked to this token. Positioned at the outset of the answer text, the <see> token allows the model to first see-observing the regions of the image related to the input question-and then tell-providing articulated textual responses. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets. Leveraging the advanced text processing prowess of GPT-4, we develop the TVG (TableQA with Vision Groundi
Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information retrieval and utilization, a "know but don't tell" phenomenon. We further analyze the relationship between extraction time and final accuracy, offering insights into the underlying mechanics of transformer models.
The gauge group of strong and electroweak interactions in Nature could be any of the four that share the same Lie algebra, $SU(3)_c\times SU(2)_L\times U(1)_Y/Z_p\equiv G_p$ with $Z_p=\left\{Z_6,Z_3,Z_2,Z_1\right\}$. Each of these cases allows in its spectrum for the matter fields of the SM but also for new distinctive representations, e.g. under the assumption that $q_L$ possesses the minimum possible hypercharge in Nature, $G_p$ allows for particles with a multiple of $p\,e/6$ for electric charge. This letter discusses how these new possibilities in the spectrum could be used to tell the SM group apart.