搜索 — ResearchTracker

Desktop interaction streams provide a continuous, privacy-sensitive record of interleaved user tasks. Transforming these streams into task-organized personal logs on-device faces two main challenges: exhaustive Vision-Language Model (VLM) processing strains local resources, and global stream processing causes cross-task context pollution. We present FOCAL (Filtered On-device Continuous Activity Logging), a privacy-first multi-agent system utilizing a unified filter-plan-log architecture. It cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for context-coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under $A{\to}B{\to}A$ task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on-device summarization of instruction-free desktop stream

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

arXiv2025-10-07作者：Suhwan Choi, Jaeyoon Jung, Haebin Seong

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and na

搜索结果：desktop

FOCAL: Filtered On-device Continuous Activity Logging for Efficient Personal Desktop Summarization

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Proteus: Shapeshifting Desktop Visualizations for Mobile via Multi-level Intelligent Adaptation

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Traversing Dual Realities: Investigating Techniques for Transitioning 3D Objects between Desktop and Augmented Reality Environments

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

Undefined Behavior in C and C++: An Experiment With Desktop Use Cases

ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks

UFO2: The Desktop AgentOS

Use Cases for High Performance Research Desktops

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

DeGrip: A Compact Cable-driven Robotic Gripper for Desktop Disassembly

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

A multimodal gesture recognition dataset for desktop human-computer interaction

GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

A Real-time 3D Desktop Display

"Pass the butter": A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT

Design and development of desktop braille printing machine at Fablab Nepal

Evaluating Navigation and Comparison Performance of Computational Notebooks on Desktop and in Virtual Reality

Sharingan: Extract User Action Sequence from Desktop Recordings