Since developers invoke the build system frequently, its performance can impact productivity. Modern artifact-based build tools accelerate builds, yet prior work shows that teams may abandon them for alternatives that are easier to maintain. While prior work shows why downgrades are performed, the implications of downgrades remain largely unexplored. In this paper, we describe a case study of the Kubernetes project, focusing on its downgrade from an artifact-based build tool (Bazel) to a language-specific solution (Go Build). We reproduce and analyze the full and incremental builds of change sets during the downgrade period. On the one hand, we find that Bazel builds are faster than Go Build, completing full builds in 23.06-38.66 up to 75.19 impose a larger memory footprint than Go Build of 81.42-351.07 respectively. Bazel builds also impose a greater CPU load at parallelism settings above eight for full builds and above one for incremental builds. We estimate that downgrading from Bazel can increase CI resource costs by up to 76 explore whether our observations generalize by replicating our Kubernetes study on four other projects that also downgraded from Bazel to older build tool
Build systems become an indispensable part of the software implementation and deployment process. New programming languages are released with the build system integrated into the language tools, for example, Go, Rust, or Zig. However, in the hardware description domain, no official build systems have been released with the predominant Hardware Description Languages (HDL) such as VHDL or SystemVerilog. Moreover, hardware design projects are often multilingual. The paper characterizes and compares two common approaches for hardware build system implementations. The first one, the direct-Tcl approach, in which the build system code is executed directly by the EDA tool during the design build flow. The second one, the indirect-abstract approach, in which the build system produces a Tcl script, which is later run by a proper EDA tool. As none of the existing direct-Tcl build systems was close to the indirect-abstract build systems in terms of supported functionalities, the paper also presents a new direct-Tcl hardware build system called HBS. The implemented build system was used as a representative of direct-Tcl build systems in comparison with indirect-abstract build systems.
Continuous Integration (CI) systems often run many builds concurrently. In this setting, a legitimate build failure may not be caused by the code push that triggered it. Such unrelated build failures can waste developer effort because developers must determine whether the failure is actionable for their current change. We study 77,354 CI build failures from seven open source Apache projects to understand and predict unrelated build failures. We find that developers spend a median of 4 hours identifying whether a failure is related or unrelated to their push. We also perform a document analysis of 371 confirmed unrelated build failures sampled from 10,316 potentially unrelated failures. The analysis shows that unrelated test failures account for 20% of the cases in which developers classify build failures as unrelated. To predict unrelated build failures, we extract 33 features from issue reports, issue comments, and commits associated with the triggering push. We build semi-supervised Positive and Unlabeled (PU) learning models for seven Apache projects. The models achieve precision from 0.70 to 0.88, recall from 0.30 to 1.00, F1-score from 0.44 to 0.91, and AUC from 0.63 to 0.97.
A long continuous integration (CI) build forces developers to wait for CI feedback before starting subsequent development activities, leading to time wasted. In addition to a variety of build scheduling and test selection heuristics studied in the past, new artifact-based build technologies like Bazel have built-in support for advanced performance optimizations such as parallel build and incremental build (caching of build results). However, little is known about the extent to which new build technologies like Bazel deliver on their promised benefits, especially for long-build duration projects. In this study, we collected 383 Bazel projects from GitHub, then studied their parallel and incremental build usage of Bazel in 4 popular CI services, and compared the results with Maven projects. We conducted 3,500 experiments on 383 Bazel projects and analyzed the build logs of a subset of 70 buildable projects to evaluate the performance impact of Bazel's parallel builds. Additionally, we performed 102,232 experiments on the 70 buildable projects' last 100 commits to evaluate Bazel's incremental build performance. Our results show that 31.23% of Bazel projects adopt a CI service but do n
Kettle is an attested build system that produces cryptographically verifiable provenance for software built inside Trusted Execution Environments (TEEs). A Kettle build records the source commit, dependency set, toolchain, build environment, and output artifact digests in a provenance document produced inside a measured confidential VM. The SHA-256 digest of that document is committed to the TEE platform's attestation report-data field, so the hardware-signed attestation report is itself the signature on the provenance, with the signing identity chaining to the TEE manufacturer's root of trust rather than to the build infrastructure operator. Because the CVM image is itself reproducible, its launch measurement is public and stable, which lets a build requester pre-attest the CVM before submitting any input and optionally deliver source over a TLS channel terminated inside it, so the build runs end-to-end confidentially without the host ever seeing source code in plaintext. Verification reduces to one signature check against the vendor root and a small set of digest comparisons, with no need to re-execute the build. The result removes the build infrastructure, its operators, and the
Continuous Integration (CI) consists of an automated build process involving continuous compilation, testing, and packaging of the software system. While CI comes up with several advantages related to quality and time to delivery, CI also presents several challenges addressed by a large body of research. To better understand the literature so as to help practitioners find solutions for their problems and guide future research, we conduct a systematic review of 97 studies on build optimization published between 2006 and 2024, which we summarized according to their goals, methodologies, used datasets, and leveraged metrics. The identified build optimization studies focus on two main challenges: (1) long build durations, and (2) build failures. To meet the first challenge, existing studies have developed a range of techniques, including predicting build outcome and duration, selective build execution, and build acceleration using caching or repairing performance smells. The causes of build failures have been the subject of several studies, leading to the development of techniques for predicting build script maintenance and automating repair. Recent studies have also focused on predict
The software build process transforms source code into deployable artifacts, representing a critical yet vulnerable stage in software development. Build infrastructure security poses unique challenges: the complexity of multi-component systems (source code, dependencies, build tools), the difficulty of detecting intrusions during compilation, and prevalent build non-determinism that masks malicious modifications. Despite these risks, the security community lacks a systematic understanding of build-specific attack vectors, hindering effective defense design. This paper presents an empirically-derived taxonomy of attack vectors targeting the build process, constructed through a large-scale CVE mining (of 621 vulnerability disclosures from the NVD database). We categorize attack vectors by their injection points across the build pipeline, from source code manipulation to compiler compromise. To validate our taxonomy, we analyzed 168 documented software supply chain attacks, identifying 40 incidents specifically targeting build phases. Our analysis reveals that 23.8\% of supply chain attacks exploit build vulnerabilities, with dependency confusion and build script injection representin
Incremental and parallel builds are crucial features of modern build systems. Parallelism enables fast builds by running independent tasks simultaneously, while incrementality saves time and computing resources by processing the build operations that were affected by a particular code change. Writing build definitions that lead to error-free incremental and parallel builds is a challenging task. This is mainly because developers are often unable to predict the effects of build operations on the file system and how different build operations interact with each other. Faulty build scripts may seriously degrade the reliability of automated builds, as they cause build failures, and non-deterministic and incorrect build results. To reason about arbitrary build executions, we present buildfs, a generally-applicable model that takes into account the specification (as declared in build scripts) and the actual behavior (low-level file system operation) of build operations. We then formally define different types of faults related to incremental and parallel builds in terms of the conditions under which a file system operation violates the specification of a build operation. Our testing appr
When Computer Science (CS) students try to use or extend open-source software (OSS) projects, they often encounter the common challenge of OSS failing to build on their local machines. Even though OSS often provides ready-to-build packages, subtle differences in local environment setups can lead to build issues, costing students tremendous time and effort in debugging. Despite the prevalence of build issues faced by CS students, there is a lack of studies exploring this topic. To investigate the build issues frequently encountered by CS students and explore methods to help them resolve these issues, we conducted a novel dual-phase study involving 330 build tasks among 55 CS students. Phase I characterized the build issues students faced, their resolution attempts, and the effectiveness of those attempts. Based on these findings, Phase II introduced an intervention method that emphasized key information (e.g., recommended programming language versions) to students. The study demonstrated the effectiveness of our intervention in improving build success rates. Our research will shed light on future directions in related areas, such as CS education on best practices for software builds
Incremental and parallel builds performed by build tools such as Make are the heart of modern C/C++ software projects. Their correct and efficient execution depends on build scripts. However, build scripts are prone to errors. The most prevalent errors are missing dependencies (MDs) and redundant dependencies (RDs). The state-of-the-art methods for detecting these errors rely on clean builds (i.e., full builds of a subset of software configurations in a clean environment), which is costly and takes up to multiple hours for large-scale projects. To address these challenges, we propose a novel approach called EChecker to detect build dependency errors in the context of incremental builds. The core idea of EChecker is to automatically update actual build dependencies by inferring them from C/C++ pre-processor directives and Makefile changes from new commits, which avoids clean builds when possible. EChecker achieves higher efficiency than the methods that rely on clean builds while maintaining effectiveness. We selected 12 representative projects, with their sizes ranging from small to large, with 240 commits (20 commits for each project), based on which we evaluated the effectiveness
Open source C code underpins society's computing infrastructure. Decades of work has helped harden C code against attackers, but C projects do not consist of only C code. C projects also contain build system code for automating development tasks like compilation, testing, and packaging. These build systems are critcal to software supply chain security and vulnerable to being poisoned, with the XZ Utils and SolarWinds attacks being recent examples. Existing techniques try to harden software supply chains by verifying software dependencies, but such methods ignore the build system itself. Similarly, classic software security checkers only analyze and monitor program code, not build system code. Moreover, poisoned build systems can easily circumvent tools for detecting program code vulnerabilities by disabling such checks. We present development phase isolation, a novel strategy for hardening build systems against poisoning by modeling the information and behavior permissions of build automation as if it were program code. We have prototyped this approach as a tool called Foreman, which successfully detects and warns about the poisoned test files involved in the XZ Utils attack. We ou
In this paper we present attestable builds, a new paradigm to provide strong source-to-binary correspondence in software artifacts. We tackle the challenge of opaque build pipelines that disconnect the trust between source code, which can be understood and audited, and the final binary artifact which is difficult to inspect. Our system uses modern trusted execution environments (TEEs) and sandboxed build containers to provide strong guarantees that a given artifact was correctly built from a specific source code snapshot. As such it complements existing approaches like reproducible builds which typically require time-intensive modifications to existing build configurations and dependencies, and require independent parties to continuously build and verify artifacts. In comparison, an attestable build requires only minimal changes to an existing project, and offers nearly instantaneous verification of the correspondence between a given binary and the source code and build pipeline used to construct it. We evaluate it by building open-source software libraries - focusing on projects which are important to the trust chain and have proven difficult to be built deterministically. The ove
In modern software engineering, build systems play the crucial role of facilitating the conversion of source code into software artifacts. Recent research has explored high-level causes of build failures, but has largely overlooked the structural properties of build files. Akin to source code, build systems face technical debt challenges that hinder maintenance and optimization. While refactoring is often seen as a key tool for addressing technical debt in source code, there is a significant research gap regarding the specific refactoring changes developers apply to build code and whether these refactorings effectively address technical debt. In this paper, we address this gap by examining refactorings applied to build scripts in open-source projects, covering the widely used build systems of Gradle, Ant, and Maven. Additionally, we investigate whether these refactorings are used to tackle technical debts in build systems. Our analysis was conducted on \totalCommits examined build-file-related commits. We identified \totalRefactoringCategories build-related refactorings, which we divided into \totalCategories main categories. These refactorings are organized into the first empirica
Despite the indisputable benefits of Continuous Integration (CI) pipelines (or builds), CI still presents significant challenges regarding long durations, failures, and flakiness. Prior studies addressed CI challenges in isolation, yet these issues are interrelated and require a holistic approach for effective optimization. To bridge this gap, this paper proposes a novel idea of developing Digital Twins (DTs) of build processes to enable global and continuous improvement. To support such an idea, we introduce the CI Build process Digital Twin (CBDT) framework as a minimum viable product. This framework offers digital shadowing functionalities, including real-time build data acquisition and continuous monitoring of build process performance metrics. Furthermore, we discuss guidelines and challenges in the practical implementation of CBDTs, including (1) modeling different aspects of the build process using Machine Learning, (2) exploring what-if scenarios based on historical patterns, and (3) implementing prescriptive services such as automated failure and performance repair to continuously improve build processes.
Building Android applications reliably remains a persistent challenge due to complex dependencies, diverse configurations, and the rapid evolution of the Android ecosystem. This study conducts an empirical analysis of 200 open-source Android projects written in Java and Kotlin to diagnose and resolve build failures. Through a five-phase process encompassing data collection, build execution, failure classification, repair strategy design, and LLM-assisted evaluation, we identified four primary types of build errors: environment issues, dependency and Gradle task errors, configuration problems, and syntax/API incompatibilities. Among the 135 projects that initially failed to build, our diagnostic and repair strategy enabled developers to resolve 102 cases (75.56%), significantly reducing troubleshooting effort. We further examined the potential of Large Language Models, such as GPT-5, to assist in error diagnosis, achieving a 53.3% success rate in suggesting viable fixes. An analysis of project attributes revealed that build success is influenced by programming language, project age, and app size. These findings provide practical insights into improving Android build reliability and
The rapid adoption of AI coding agents for software development has raised important questions about the quality and maintainability of the code they produce. While prior studies have examined AI-generated source code, the impact of AI coding agents on build systems-a critical yet understudied component of the software lifecycle-remains largely unexplored. This data mining challenge focuses on AIDev, the first large-scale, openly available dataset capturing agent-authored pull requests (Agentic-PRs) from real-world GitHub repositories. Our paper leverages this dataset to investigate (RQ1) whether AI coding agents generate build code with quality issues (e.g., code smells), (RQ2) to what extent AI agents can eliminate code smells from build code, and (RQ3) to what extent Agentic-PRs are accepted by developers. We identified 364 maintainability and security-related build smells across varying severity levels, indicating that AI-generated build code can introduce quality issues-such as lack of error handling, and hardcoded paths or URLs-while also, in some cases, removing existing smells through refactorings (e.g., Pull Up Module and Externalize Properties). Notably, more than 61\% of
Developers rely on build systems to generate software from code. At a minimum, a build system should produce build targets from a clean copy of the code. However, developers rarely work from clean checkouts. Instead, they rebuild software repeatedly, sometimes hundreds of times a day. To keep rebuilds fast, build systems run incrementally, executing commands only when built state cannot be reused. Existing tools like make present users with a tradeoff. Simple build specifications are easy to write, but limit incremental work. More complex build specifications produce faster incremental builds, but writing them is labor-intensive and error-prone. This work shows that no such tradeoff is necessary; build specifications can be both simple and fast. We introduce LaForge, a novel build tool that eliminates the need to specify dependencies or incremental build steps. LaForge builds are easy to specify; developers write a simple script that runs a full build. Even a single command like gcc src/*.c will suffice. LaForge traces the execution of the build and generates a transcript in the TraceIR language. On later builds, LaForge evaluates the TraceIR transcript to detect changes and perfor
Google has a monolithic codebase with tens of millions build targets. Each build target specifies the information that is needed to build a software artifact or run tests. It is common to execute a subset of build targets at each revision and make sure that the change does not break the codebase. Google's build service system uses Bazel to build targets. Bazel takes as input a build that specifies the execution context, flags and build targets to run. The outputs are the build libraries, binaries or test results. To be able to support developer's daily activities, the build service system runs millions of builds per day. It is a known issue that a build with many targets could run out of the allocated memory or exceed its execution deadline. This is problematic because it reduces the developer's productivity, e.g. code submissions or binary releases. In this paper, we propose a technique that predicts the memory usage and executor occupancy of a build. The technique batches a set of targets such that the build created with those targets does not run out of memory or exceed its deadline. This approach significantly reduces the number of builds that run out of memory or exceed the de
Build systems are an essential part of modern software engineering projects. As software projects change continuously, it is crucial to understand how the build system changes because neglecting its maintenance can lead to expensive build breakage. Recent studies have investigated the (co-)evolution of build configurations and reasons for build breakage, but they did this only on a coarse grained level. In this paper, we present BUILDDIFF, an approach to extract detailed build changes from MAVEN build files and classify them into 95 change types. In a manual evaluation of 400 build changing commits, we show that BUILDDIFF can extract and classify build changes with an average precision and recall of 0.96 and 0.98, respectively. We then present two studies using the build changes extracted from 30 open source Java projects to study the frequency and time of build changes. The results show that the top 10 most frequent change types account for 73% of the build changes. Among them, changes to version numbers and changes to dependencies of the projects occur most frequently. Furthermore, our results show that build changes occur frequently around releases. With these results, we provid
Accurate building energy forecasting is essential, yet traditional heuristics often lack precision, while advanced models can be opaque and struggle with generalization by neglecting physical principles. This paper introduces BuildEvo, a novel framework that uses Large Language Models (LLMs) to automatically design effective and interpretable energy prediction heuristics. Within an evolutionary process, BuildEvo guides LLMs to construct and enhance heuristics by systematically incorporating physical insights from building characteristics and operational data (e.g., from the Building Data Genome Project 2). Evaluations show BuildEvo achieves state-of-the-art performance on benchmarks, offering improved generalization and transparent prediction logic. This work advances the automated design of robust, physically grounded heuristics, promoting trustworthy models for complex energy systems.