搜索结果：When

共找到 20 条结果

高级筛选 ▾

When2Call: When (not) to Call Tools

arXiv2025-04-26作者：Hayley Ross, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara

Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at https://github.com/NVIDIA/When2Call.

When to Screen, When to Bypass: LLM-Judges in Resource-Scarce AI-Human Workflow

arXiv2026-03-14作者：Ruihan Lin, Jiheng Zhang

AI systems can generate outputs at scale, but most outputs require human approval before release. This creates a bottleneck: humans cannot keep pace with AI-generated volume. A natural response is to insert an LLM-judge that screens outputs before they reach humans, filtering errors and amplifying effective review capacity. But judges are imperfect. False rejections send correct outputs back for unnecessary rework; false acceptances consume judge capacity without relieving humans. When should outputs be routed through the judge, and when should they bypass it directly to human review? We model this workflow as a queueing network with three resource pools and use a fluid approximation to characterize optimal judge allocation. The analysis reveals that optimal allocation depends critically on which resource is the current bottleneck: screening amplifies human capacity when reviewers are scarce, yet generates a rework trap that crowds out new production when workers are stretched thin. For heterogeneous task classes with different error profiles, optimal priority can reverse across operating regimes, and classes with complementary error structures can be mixed to achieve throughput th

搜索结果：When

When2Call: When (not) to Call Tools

When to Screen, When to Bypass: LLM-Judges in Resource-Scarce AI-Human Workflow

When Medical AI Explanations Help and When They Harm

When Does Structure Matter in Continual Learning? Dimensionality Controls When Modularity Shapes Representational Geometry

Persuasion in the Long Run: When history matters

When Models Know When They Do Not Know: Calibration, Cascading, and Cleaning

When to Request Evidence?

When Max_d(G) is zero-dimensional

Learning When Not to Attend Globally

Honesty in Causal Forests: When It Helps and When It Hurts

Measuring nodes centrality when local and global measures overlap

When is the canonical conductor minimal?

When to Localize? A POMDP Approach

Measuring and Predicting Where and When Pathologists Focus their Visual Attention while Grading Whole Slide Images of Cancer

When to Accept Automated Predictions and When to Defer to Human Judgment?

Discrete vs. continuous dynamics in biology: When do they align and when do they diverge?

When Numbers Mislead Us

When does word order matter and when doesn't it?

When Strings Surprise

When is a TRAAG orderable?