搜索 — ResearchTracker

Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance compar

Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com

arXiv2026-05-01作者：Eva Agapaki, Amritpal Singh Gill

Contrastive learning is a core component of modern retrieval systems, but its effectiveness heavily relies on the quality of negative examples used during training. In this work, we present a systematic approach to improving dense retrieval for IKEA product search through structured negative sampling strategies and scalable LLM-as-a-judge relevance evaluation. Building on IKEA Search Engine's late-interaction retrieval architectures, we introduce two key contributions: (1) structured negative sampling strategies that leverage product hierarchical taxonomy and product attributes to generate semantically challenging negatives, and (2) a comprehensive LLM-based evaluation methodology for generating training data. Rather than relying on sparse human annotations or random sampling, our LLM-based evaluation system allocates a score for all candidate products against each query. Our methodology achieves +2.6\% average category accuracy on offline real user query experiments on the Canada market. However, our A/B test on long-tail queries showed no statistically significant differences in user engagement metrics between the improved and baseline models ($p > 0.05$). We trace this gap to

搜索结果：Ikeas

Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com

Influence of Interactivity in Shaping User Experience and Social Acceptance of Mobile XR

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Contrastive Learning for Diversity-Aware Product Recommendations in Retail

Accelerating Physical Property Reasoning for Augmented Visual Cognition

Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

Probabilistic Temporal Masked Attention for Cross-view Online Action Detection

Multi Activity Sequence Alignment via Implicit Clustering

Underactuated dexterous robotic grasping with reconfigurable passive joints

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries

Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

The Importance of Cognitive Biases in the Recommendation Ecosystem

Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

Hierarchical Vector Quantization for Unsupervised Action Segmentation

ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network