Language workbenches are tools that enable the definition, reuse, and composition of programming languages and their ecosystems, aiming to streamline language development. To facilitate their adoption by language designers, the comprehensibility of the language used to define other languages is an important aspect to evaluate. Moreover, considering that language workbenches are relatively new tools, user acceptance emerges as a crucial factor to be accounted for during their assessment. Current literature often neglects user-centred aspects like comprehensibility and acceptance in the assessment of this breed of tools. This paper addresses this gap through a family of experiments assessing Neverlang, a modular language workbench. The study adopts a tailored version of the Method Evaluation Model (MEM) to evaluate the comprehensibility of Neverlang's meta-language and programs, as well as user acceptance in terms of perceived ease of use, perceived usefulness, and intention to use. It also investigates the relationships among these dimensions. The experiments were conducted in three iterations involving participants from academia. The results reveal that users demonstrate sufficient
Modern engineering design platforms excel at discipline-specific tasks such as CAD, CAM, and CAE, but often lack native systems engineering frameworks. This creates a disconnect where system-level requirements and architectures are managed separately from detailed component design, hindering holistic development and increasing integration risks. To address this, we present the conceptual framework for the GenAI Workbench, a Model-Based Systems Engineering (MBSE) environment that integrates systems engineering principles into the designer's workflow. Built on an open-source PLM platform, it establishes a unified digital thread by linking semantic data from documents, physical B-rep geometry, and relational system graphs. The workbench facilitates an AI-assisted workflow where a designer can ingest source documents, from which the system automatically extracts requirements and uses vision-language models to generate an initial system architecture, such as a Design Structure Matrix (DSM). This paper presents the conceptual architecture, proposed methodology, and anticipated impact of this work-in-progress framework, which aims to foster a more integrated, data-driven, and informed eng
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at https://github.com/oll
Earth system science is producing increasingly large, high-dimensional datasets from physics based Earth system models to AI-based weather and climate models. Embedding-based representations can make these data searchable through similarity search and analog retrieval, but nearest neighbors in latent space are not automatically scientifically meaningful: it may reflect real weather structure, or preprocessing, geography, or model bias. Researchers therefore need ways to inspect how embeddings organize meteorological data, compare representation models, develop retrieval strategies, and verify results against physical evidence. We present an open-source visual analytics workbench for each of these steps. The system links embedding experiments to source data, metadata, spatial context, and model configurations, so latent-space results can be traced back to the physics. Users can explore latent spaces for different models, issue global or localized queries, and inspect analogs through familiar meteorological views. This enables a discovery workflow in which scientists characterize a phenomenon of interest in a well-understood dataset, identifying its signature in latent space, and the
Database workloads are increasingly nesting artificial intelligence (AI) and machine learning (ML) pipelines and AI/ML model inferences with data processing, yielding hybrid SQL+AI/ML queries that mix relational operators with expensive, opaque AI/ML operators, often expressed as UDFs. These workloads are challenging to optimize because ML operators behave like black boxes, data-dependent effects such as sparsity, selectivity, and cardinalities can dominate runtime, domain experts often rely on practical heuristics that are difficult to develop with monolithic optimizers, and AI/ML operators introduce numerous co-optimization opportunities such as factorization, pushdown, ML-to-SQL conversion, and linear-algebra-to-relational-algebra rewrites, significantly enlarging the search space of equivalent execution plans. At the same time, research prototypes for SQL+ML optimization are difficult to evaluate fairly because they are typically developed on different platforms and evaluated using different queries. We present OptBench, an interactive workbench for building and benchmarking query optimizers for hybrid SQL+AI/ML queries in a transparent, apples-to-apples manner. OptBench runs a
LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR's LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible at the token level. The tool treats the generated text as a research object in its own right from a probability distribution, a text that could have been otherwise, and provides visualisations including continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains, t
This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of responses generated by LLM-based systems. We provide a workbench to explore several alternative evaluation approaches to judge the relevance of a system's response that incorporate LLMs: 1. Asking an LLM whether the response is relevant; 2. Asking the LLM which set of nuggets (i.e., relevant key facts) is covered in the response; 3. Asking the LLM to answer a set of exam questions with the response. This workbench aims to facilitate the development of new, reusable test collections. Researchers can manually refine sets of nuggets and exam questions, observing their impact on system evaluation and leaderboard rankings. Resource available at https://github.com/TREMA-UNH/autograding-workbench
With rise of blockchain popularity, more and more people seek to implement blockchain technology into their projects. Most common way is to take existing blockchain stack, such as Azure Blockchain Workbench or Oracle Blockchain Platform. While the blockchain technology is well-protected by its algorithms it is still vulnerable because its privacy relies on regular cryptography. And mistakes or vulnerabilities in key management protocols can affect even the most secure blockchain projects. This article considers question of vulnerabilities within Azure Blockchain Workbench key management system. We describe potential threats for each stage of key management lifecycle based on public reports and then assess how likely are those threats to realize within Azure Blockchain Workbench environment based on the technical documentation for Azure Blockchain Workbench and Azure Key Vault. Finally, we compile results of our assessment into the key management threat table with three distinct degrees of protection: fully protected, partially protected and not protected.
The global production of electric goods is at an all-time high, causing negative environmental and health impacts as well as a continuing depletion of natural resources. Considering the worsening global climate change, a transition of current industrial processes is necessary to tackle the above-mentioned factors. To address this urgent issue, socio-economic systems like the Circular Economy (CE) provide options to reallocate the use of resources and products on a global scale. Especially in terms of product lifecycle-prolonging, this system provides suitable approaches to alter the current modes of product handling by society and industry alike, based on the condition of the products. Although the importance and benefits of sustainable services enabling these options are widely known, users tend to shy away from using them. One of the reasons is the missing reliability in terms of the knowledge of the costs associated with a particular service. This uncertainty in expected pricing can, therefore, lower the willingness of potential clients. However, not only clients struggle with the boundary conditions of such services. On the part of the potential providers of services, the monet
Language Workbenches offer language designers an expressive environment in which to create their DSLs. Similarly, research into mechanised meta-theory has shown how dependently typed languages provide expressive environments to formalise and study DSLs and their meta-theoretical properties. But can we claim that dependently typed languages qualify as language workbenches? We argue yes! We have developed an exemplar DSL called Velo that showcases not only dependently typed techniques to realise and manipulate IRs, but that dependently typed languages make fine language workbenches. Velo is a simple verified language with well-typed holes and comes with a complete compiler pipeline: parser, elaborator, REPL, evaluator, and compiler passes. Specifically, we describe our design choices for well-typed IRs design that includes support for well-typed holes, how CSE is achieved in a well-typed setting, and how the mechanised type-soundness proof for Velo is the source of the evaluator.
NLP Workbench is a web-based platform for text mining that allows non-expert users to obtain semantic understanding of large-scale corpora using state-of-the-art text mining models. The platform is built upon latest pre-trained models and open source systems from academia that provide semantic analysis functionalities, including but not limited to entity linking, sentiment analysis, semantic parsing, and relation extraction. Its extensible design enables researchers and developers to smoothly replace an existing model or integrate a new one. To improve efficiency, we employ a microservice architecture that facilitates allocation of acceleration hardware and parallelization of computation. This paper presents the architecture of NLP Workbench and discusses the challenges we faced in designing it. We also discuss diverse use cases of NLP Workbench and the benefits of using it over other approaches. The platform is under active development, with its source code released under the MIT license. A website and a short video demonstrating our platform are also available.
The design of nuclear imaging scanners is crucial for optimizing detection and imaging processes. While advancements have been made in simplistic, symmetrical modalities, current research is progressing towards more intricate structures, however, the widespread adoption of computer-aided design (CAD) tools for modeling and simulation is still limited. This paper introduces FreeCAD and the GDML Workbench as essential tools for designing and testing complex geometries in nuclear imaging modalities. FreeCAD is a parametric 3D CAD modeler, and GDML is an XML-based language for describing complex geometries in simulations. Their integration streamlines the design and simulation of nuclear medicine scanners, including PET and SPECT scanners. The paper demonstrates their application in creating calibration phantoms and conducting simulations with Geant4, showcasing their precision and versatility in generating sophisticated components for nuclear imaging. The integration of these tools is expected to streamline design processes, enhance efficiency, and facilitate widespread application in the nuclear imaging field.
Metamodel-based DSL development in language workbenches like Xtext allows language engineers to focus more on metamodels and domain concepts rather than grammar details. However, the grammar generated from metamodels often requires manual modification, which can be tedious and time-consuming. Especially when it comes to rapid prototyping and language evolution, the grammar will be generated repeatedly, this means that language engineers need to repeat such manual modification back and forth. Previous work introduced GrammarOptimizer, which automatically improves the generated grammar using optimization rules. However, the optimization rules need to be configured manually, which lacks user-friendliness and convenience. In this paper, we present our vision for and current progress towards a language workbench that integrates GrammarOptimizer's grammar optimization rules to support rapid prototyping and evolution of metamodel-based languages. It provides a visual configuration of optimization rules and a real-time preview of the effects of grammar optimization to address the limitations of GrammarOptimizer. Furthermore, it supports the inference of a grammar based on examples from mod
In recent years, researchers have proposed a number of automated tools to identify and improve floating-point rounding error in mathematical expressions. However, users struggle to effectively apply these tools. In this paper, we work with novices, experts, and tool developers to investigate user needs during the expression rewriting process. We find that users follow an iterative design process. They want to compare expressions on multiple input ranges, integrate and guide various rewriting tools and understand where errors come from. We organize this investigation's results into a three-stage workflow and implement that workflow in a new, extensible workbench dubbed Odyssey. Odyssey enables users to: (1) diagnose problems in an expression, (2) generate solutions automatically or by hand, and (3) tune their results. Odyssey tracks a working set of expressions and turns a state-of-the-art automated tool "inside out," giving the user access to internal heuristics, algorithms, and functionality. In a user study, Odyssey enabled five expert numerical analysts to solve challenging rewriting problems where state-of-the-art automated tools fail. In particular, the experts unanimously pra
Electric motors can be damaged or operate improperly from a possible set of failures. Such failures are related to high or very low voltage and current levels, phase loss or blocked rotor. Therefore, it is important to protect these equipments through appropriate mechanisms. Alternatively, a workbench can simulate detectable failures related to the engines, allowing to change parameters, in which maintenance operators are able to identify the results of these changes. This work presents the development of a workbench as a tool for testing electrical machines and drives. The workbench is based on the Arduino programming platform (microcontroller system), in which it checks the functioning of electric motors under the condition of failures that may occur in this engine. Motor protections are carried out through an Intelligent Electronic Device (IED), which are popularly known as intelligent relays. The results show the development of a workbench that can test and identify several faults in a small three-phase motor.
Experimentation with software prototypes plays a fundamental role in software engineering research. In contrast to many other scientific disciplines, however, explicit support for this key activity in software engineering is relatively small. While some approaches to improve this situation have been proposed by the software engineering community, experiments are still very difficult and sometimes impossible to replicate. In this paper, we propose the concept of an experimentation workbench as a means of explicit support for experimentation in software engineering research. In particular, we discuss core requirements that an experimentation workbench should satisfy in order to qualify as such and to offer a real benefit for researchers. Beyond their core benefits for experimentation, we stipulate that experimentation workbenches will also have benefits in regard to reproducibility and repeatability of software engineering research. Further, we illustrate this concept with a scenario and a case study, and describe relevant challenges as well as our experience with experimentation workbenches.
Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executabil
The development of programming languages involves complex theoretical and practical challenges, particularly when addressing modularity and reusability through language extensions. While language workbenches aim to enable modular development under the constraints of the language extension problem, one critical constraint -- separate compilation -- is often relaxed due to its complexity. However, this relaxation undermines artifact reusability and integration with common dependency systems. A key difficulty under separate compilation arises from managing attribute grammars, as extensions may introduce new attributes that invalidate previously generated abstract syntax tree structures. Existing approaches, such as the use of dynamic maps in the Neverlang workbench, favor flexibility at the cost of compile-time correctness, leading to potential runtime errors due to undefined attributes. This work addresses this issue by introducing nlgcheck, a theoretically sound static analysis tool based on data-flow analysis for the Neverlang language workbench. nlgcheck detects potential runtime errors -- such as undefined attribute accesses -- at compile time, preserving separate compilation whi
In this paper, the steeped-transmission shaft design problem is proposed for weight optimization. The bio-inspired search-based Snail Homing and Mating Search (SHMS) algorithm is utilized to solve the problem. It is inspired by the social behaviour of snails and their inherent nature of finding better homes, and mate. The proposed steeped-transmission shaft design problem is modelled considering the fatigue loading, combined bending, torsion loads, and the principle of Modified Goodman criteria. The forces diagram and the bending moment diagrams are obtained using the MDSOLIDS software. The forces and bending moment are then used to mathematical model the objective function and constraints. The SHMS algorithm has yielded the desired solution with reasonable computational cost. The constraints are handled using a static penalty function approach. The statistical results obtained using SHMS algorithm are further used for generating CAD model. The analysis is carried out in ANSYS Workbench. Further, the deflection obtained from SHMS algorithm and ANSYS Workbench are compared and results are discussed in details.
Applications in labor market intelligence demand specialized NLP systems for a wide range of tasks, characterized by extreme multi-label target spaces, strict latency constraints, and multiple text modalities such as skills and job titles. These constraints have led to isolated, task-specific developments in the field, with models and benchmarks focused on single prediction tasks. Exploiting the shared structure of work-related data, we propose a unifying framework, combining a wide range of tasks in a multi-task ranking benchmark, and a flexible architecture tackling text-driven work tasks with a single model. The benchmark, WorkBench, is the first unified evaluation suite spanning six work-related tasks formulated explicitly as ranking problems, curated from real-world ontologies and human-annotated resources. WorkBench enables cross-task analysis, where we find significant positive cross-task transfer. This insight leads to Unified Work Embeddings (UWE), a task-agnostic bi-encoder that exploits our training-data structure with a many-to-many InfoNCE objective, and leverages token-level embeddings with task-agnostic soft late interaction. UWE demonstrates zero-shot ranking perfor