Current LLM assistants are powerful at answering questions, but they have limited access to the behavioral context that reveals when and where a user is struggling. We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance. We instantiate this vision in a controlled study (n=36) comparing the gaze-aware AI assistant to a text-only LLM assistant. Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate and personalized in its assessments of users' reading behavior and significantly improved people's ability to recall information. Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions. Qualitative results underscored both perceived benefits in comprehension and challenges when interpretations of gaze behaviors were inaccurate. Our findings suggest that gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.
Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model's ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge de
As large language models (LLMs) become increasingly integrated into daily life, there is growing demand for AI assistants that are not only reactive but also proactive and personalized. While recent advances have pushed forward proactivity and personalization individually, their combination remains underexplored. To bridge this gap, we introduce ProPerSim, a new task and simulation framework for developing assistants capable of making timely, personalized recommendations in realistic home scenarios. In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context. The assistant's goal is to use these ratings to learn and adapt to achieve higher scores over time. Built on ProPerSim, we propose ProPerAssistant, a retrieval-augmented, preference-aligned assistant that continually learns and adapts through user feedback. Experiments across 32 diverse personas show that ProPerAssistant adapts its strategy and steadily improves user satisfaction, highlighting the promise of uniting proactivity and personalization.
Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $α$-mixture assistant distribution, a novel generalized family of assistant distributions, and $α$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $α$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $α$,
The purpose of this paper is to enhance the performance of the virtual assistant. So, what exactly is a virtual assistant. Application software, often called virtual assistants, also known as AI assistants or digital assistants, is software that understands natural language voice commands and can perform tasks on your behalf. What does a virtual assistant do. Virtual assistants can complete practically any specific smartphone or PC activity that you can complete on your own, and the list is continually expanding. Virtual assistants typically do an impressive variety of tasks, including scheduling meetings, delivering messages, and monitoring the weather. Previous virtual assistants, like Google Assistant and Cortana, had limits in that they could only perform searches and were not entirely automated. For instance, these engines do not have the ability to forward and rewind the song in order to maintain the control function of the song; they can only have the module to search for songs and play them. Currently, we are working on a project where we are automating Google, YouTube, and many other new things to improve the functionality of this project. Now, in order to simplify the pro
Recently, AI assistants based on large language models (LLMs) show surprising performance in many tasks, such as dialogue, solving math problems, writing code, and using tools. Although LLMs possess intensive world knowledge, they still make factual errors when facing some knowledge intensive tasks, like open-domain question answering. These untruthful responses from the AI assistant may cause significant risks in practical applications. We believe that an AI assistant's refusal to answer questions it does not know is a crucial method for reducing hallucinations and making the assistant truthful. Therefore, in this paper, we ask the question "Can AI assistants know what they don't know and express them through natural language?" To answer this question, we construct a model-specific "I don't know" (Idk) dataset for an assistant, which contains its known and unknown questions, based on existing open-domain question answering datasets. Then we align the assistant with its corresponding Idk dataset and observe whether it can refuse to answer its unknown questions after alignment. Experimental results show that after alignment with Idk datasets, the assistant can refuse to answer most
While current chat-based AI assistants primarily operate reactively, responding only when prompted by users, there is significant potential for these systems to proactively assist in tasks without explicit invocation, enabling a mixed-initiative interaction. This work explores the design and implementation of proactive AI assistants powered by large language models. We first outline the key design considerations for building effective proactive assistants. As a case study, we propose a proactive chat-based programming assistant that automatically provides suggestions and facilitates their integration into the programmer's code. The programming context provides a shared workspace enabling the assistant to offer more relevant suggestions. We conducted a randomized experimental study examining the impact of various design elements of the proactive assistant on programmer productivity and user experience. Our findings reveal significant benefits of incorporating proactive chat assistants into coding environments and uncover important nuances that influence their usage and effectiveness.
We investigate the role of energy-invariant assistants in energy extraction from quantum batteries. To this end, for energy extraction, we restrict ourselves to unitaries that jointly act on the battery and the assistant but preserve the energy of the assistant. We demonstrate that, in the presence of an energy-invariant assistant having the same dimension as the battery, all stored energy of the battery can always be extracted, transforming the battery into its ground state when an appropriate joint unitary and assistant state are employed. Additionally, we provide a necessary and sufficient condition for a battery to be unable to provide any energy, i.e., to be inactive, even when an energy-invariant assistant is present and prepared in an arbitrary but fixed state.
Commercial voice assistants are largely feminized and associated with stereotypically feminine traits such as warmth and submissiveness. As these assistants continue to be adopted for everyday uses, it is imperative to understand how the portrayed gender shapes the voice assistant's ability to mitigate errors, which are still common in voice interactions. We report a study (N=40) that examined the effects of voice gender (feminine, ambiguous, masculine), error mitigation strategies (apology, compensation) and participant's gender on people's interaction behavior and perceptions of the assistant. Our results show that AI assistants that apologized appeared warmer than those offered compensation. Moreover, male participants preferred apologetic feminine assistants over apologetic masculine ones. Furthermore, male participants interrupted AI assistants regardless of perceived gender more frequently than female participants when errors occurred. Our results suggest that the perceived gender of a voice assistant biases user behavior, especially for male users, and that an ambiguous voice has the potential to reduce biases associated with gender-specific traits.
Recent advancements in LLMs have contributed to the rise of advanced conversational assistants that can assist with user needs through natural language conversation. This paper presents a ScheduleMe, a multi-agent calendar assistant for users to manage google calendar events in natural language. The system uses a graph-structured coordination mechanism where a central supervisory agent supervises specialized task agents, allowing modularity, conflicts resolution, and context-aware interactions to resolve ambiguities and evaluate user commands. This approach sets an example of how structured reasoning and agent cooperation might convince operators to increase the usability and flexibility of personal calendar assistant tools.
We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.
The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assi
This article investigates the deployment of a Voice User Interface (VUI)-powered digital assistant in a retail setting and assesses its impact on customer engagement and service efficiency. The study explores how digital assistants can enhance user interactions through advanced conversational capabilities with multilingual support. By integrating a digital assistant into a high-traffic retail environment, we evaluate its effectiveness in improving the quality of customer service and operational efficiency. Data collected during the experiment demonstrate varied impacts on customer interaction, revealing insights into the future optimizations of digital assistant technologies in customer-facing roles. This study contributes to the understanding of digital transformation strategies within the customer relations domain emphasizing the need for service flexibility and user-centric design in modern retail stores.
Demographic growth and rise in the average age of the population is increasing the demand for the elderly assistance. Health care oriented ambient intelligence technologies are fundamental to support elderly peoples' autonomy. In this paper, we present a smart home system that is able to recognize human activities and is integrated with a proactive vocal assistant. We chose one of possible user scenarios to show the performance of this smart home system and to perform a preliminary comparison between users' experience while watching videos of a volunteer interacting with an embodied versus a not-embodied assistant. The scenario is recorded from the user's point of view, while the user interacts with a robot assistant or a simple vocal assistant. The results of the User Experience Questionnaire show that participants found the robot assistant considerably more attractive, innovative and stimulating in comparison to the vocal assistant.
The development of domain-specific languages (DSLs) is a laborious and iterative process that seems to naturally lean to the use of generative artificial intelligence. We design and prototype DSL Assistant, a tool that integrates generative language models to support the development of DSLs. DSL Assistant uses OpenAI's assistant API with GPT-4o to generate DSL grammars and example instances. To reflect real-world use, DSL Assistant supports several different interaction modes for evolving a DSL design, and includes automatic error repair. Our experiments show that DSL Assistant helps users to create and modify DSLs. However, the quality of the generated DSLs depends on the specific domain and the followed interaction patterns.
In the realm of online privacy, privacy assistants play a pivotal role in empowering users to manage their privacy effectively. Although recent studies have shown promising progress in tackling tasks such as privacy violation detection and personalized privacy recommendations, a crucial aspect for widespread user adoption is the capability of these systems to provide explanations for their decision-making processes. This paper presents a privacy assistant for generating explanations for privacy decisions. The privacy assistant focuses on discovering latent topics, identifying explanation categories, establishing explanation schemes, and generating automated explanations. The generated explanations can be used by users to understand the recommendations of the privacy assistant. Our user study of real-world privacy dataset of images shows that users find the generated explanations useful and easy to understand. Additionally, the generated explanations can be used by privacy assistants themselves to improve their decision-making. We show how this can be realized by incorporating the generated explanations into a state-of-the-art privacy assistant.
Despite huge gains in performance in natural language understanding via large language models in recent years, voice assistants still often fail to meet user expectations. In this study, we conducted a mixed-methods analysis of how voice assistant failures affect users' trust in their voice assistants. To illustrate how users have experienced these failures, we contribute a crowdsourced dataset of 199 voice assistant failures, categorized across 12 failure sources. Relying on interview and survey data, we find that certain failures, such as those due to overcapturing users' input, derail user trust more than others. We additionally examine how failures impact users' willingness to rely on voice assistants for future tasks. Users often stop using their voice assistants for specific tasks that result in failures for a short period of time before resuming similar usage. We demonstrate the importance of low stakes tasks, such as playing music, towards building trust after failures.
Thanks to the powerful language comprehension capabilities of Large Language Models (LLMs), existing instruction-based image editing methods have introduced Multimodal Large Language Models (MLLMs) to promote information exchange between instructions and images, ensuring the controllability and flexibility of image editing. However, these frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks, which is not only time-consuming and labor-intensive but also fails to achieve satisfactory results. In this paper, we present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction. We instruct the open-source LLM with a specially designed prompt template to analyze user needs after receiving instructions and hierarchically invoke existing advanced editing methods, all without additional training. Moreover, we implement a plug-and-play and efficient invocation of image editing methods, allowing complex and unseen editing tasks to be integrated into the current framework, achieving stable and high-quality editing results. Extensive experiments demonstrate that o
The development of conversational AI assistants is an iterative process with multiple components. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprises, which is under active development, and how we address these challenges. We also share preliminary results and discuss lessons learned.
Virtual Assistant technology is rapidly proliferating to improve productivity in a variety of tasks. While several virtual assistants for everyday tasks are well-known (e.g., Siri, Cortana, Alexa), assistants for specialty tasks such as software engineering are rarer. One key reason software engineering assistants are rare is that very few experimental datasets are available and suitable for training the AI that is the bedrock of current virtual assistants. In this paper, we present a set of Wizard of Oz experiments that we designed to build a dataset for creating a virtual assistant. Our target is a hypothetical virtual assistant for helping programmers use APIs. In our experiments, we recruited 30 professional programmers to complete programming tasks using two APIs. The programmers interacted with a simulated virtual assistant for help - the programmers were not aware that the assistant was actually operated by human experts. We then annotated the dialogue acts in the corpus along four dimensions: illocutionary intent, API information type(s), backward-facing function, and traceability to specific API components. We observed a diverse range of interactions that will facilitate t