搜索 — ResearchTracker

While skill optimization for autonomous agents has gained traction, existing methods rely on complex pipelines. This leaves a fundamental question unaddressed: What constitutes a minimal viable pipeline for skill optimization, where every component is justified by theory or empirical necessity? We formalize skill optimization via Zeroth-Order (ZO) optimization, mapping classical counterparts (central difference, trust regions) to recent literature. Noting that unlike blind numerical perturbations in classical ZO, skill trajectories serve as interpretable debugging feedback. Grounded in Claude Code philosophy and PAC learning, we establish three principles for convergence and generalization: file-system-based trajectory exploration, consensus attribute mining, and independent validation gating. Eliminating redundancies, we propose SkillOpt-Lite. It accelerates convergence and outperforms full SkillOpt: improving LiveMath by +8.8 points on GPT-5.5 and +25.4 points on GPT-5.4-nano, allowing the nano model to surpass standard GPT-5.4 optimized by SkillOpt. Finally, we integrate our framework into production coding agents like VSCode Copilot, enabling developers to evolve agent skills v

Verifiable Benchmarking of Long-Horizon Spatial Biology

arXiv2026-05-27作者：Ian Diks, Harihara Muralidharan, Tim Proctor

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBe

搜索结果：GPT-55

SkillOpt-Lite: Better and Faster Agent Self-evolution via One Line of Vibe

Verifiable Benchmarking of Long-Horizon Spatial Biology

R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge

Do Large Language Models know Which Published Articles have been Retracted?

Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Towards Automated Detection of Inline Code Comment Smells

Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

Persona-Model Collapse in Emergent Misalignment

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward