搜索 — ResearchTracker

Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consist

Visual-ERM: Reward Modeling for Visual Equivalence

arXiv2026-03-13作者：Ziyu Liu, Shengyuan Ding, Xinyu Fang

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B d

搜索结果：Visual

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

Visual-ERM: Reward Modeling for Visual Equivalence

Tell Me Without Telling Me: Two-Way Prediction of Visualization Literacy and Visual Attention

3D Visual Illusion Depth Estimation

Does empirical evidence from healthy aging studies predict a practical difference between visualizations for different age groups?

StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

Visual Boosting Techniques for Spatiotemporal Dense Pixel Visualizations

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Visual-RFT: Visual Reinforcement Fine-Tuning

Bridging Service Design, Visualizations, and Visual Analytics in Healthcare Digital Twins: Challenges, Gaps, and Research Opportunities

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Visual Agentic Reinforcement Fine-Tuning

Visual Diffusion Models are Geometric Solvers

Towards Visual Grounding: A Survey

Learning Self-Supervised Audio-Visual Representations for Sound Recommendations

Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge

Object-level Visual Prompts for Compositional Image Generation