搜索结果：Terminal

共找到 20 条结果

高级筛选 ▾

On Data Engineering for Scaling LLM Terminal Capabilities

arXiv

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collect

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

arXiv2026-05-21作者：Zhaoyang Chu, Jiarui Hu, Xingyu Jiang

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

搜索结果：Terminal

On Data Engineering for Scaling LLM Terminal Capabilities

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

Terminal Lucidity: Envisioning the Future of the Terminal

A framework for joint assessment of a terminal event and a score existing only in the absence of the terminal event

Tmax: A simple recipe for terminal agents

Terminal Steiner tree problem : Complexity and Algorithms

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

Endless Terminals: Scaling RL Environments for Terminal Agents

Time discretization of BSDEs with singular terminal condition using asymptotic expansion

Matrix Riccati BSDEs with singular terminal condition and stochastic LQ control with linear terminal constraint

Quantitative Soft-to-Hard Terminal Constraint Convergence for the Heat Equation

Learning-based Autonomous Channel Access in the Presence of Hidden Terminals

On Continuous Terminal Embeddings of Sets of Positive Reach

A New Secret key Agreement Scheme in a Four-Terminal Network

Distributed Model Predictive Control for Linear Systems with Adaptive Terminal Sets

Terminal spaces of monoids

The Parameterized Complexity of Terminal Monitoring Set

Terminal Coalgebras in Countably Many Steps