Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations. To foster research in this area, we make our code and pre-trained models publicly available.
Software engineering activities frequently involve edits to existing code. However, contemporary code language models (LMs) lack the ability to handle diverse types of code-edit requirements. In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm. Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity. Today's code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning. To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting. Using our approach, we obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show t
Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance ma
Open source C code underpins society's computing infrastructure. Decades of work has helped harden C code against attackers, but C projects do not consist of only C code. C projects also contain build system code for automating development tasks like compilation, testing, and packaging. These build systems are critcal to software supply chain security and vulnerable to being poisoned, with the XZ Utils and SolarWinds attacks being recent examples. Existing techniques try to harden software supply chains by verifying software dependencies, but such methods ignore the build system itself. Similarly, classic software security checkers only analyze and monitor program code, not build system code. Moreover, poisoned build systems can easily circumvent tools for detecting program code vulnerabilities by disabling such checks. We present development phase isolation, a novel strategy for hardening build systems against poisoning by modeling the information and behavior permissions of build automation as if it were program code. We have prototyped this approach as a tool called Foreman, which successfully detects and warns about the poisoned test files involved in the XZ Utils attack. We ou
We introduce transversal dimension jump, a code-switching protocol for lifted product (LP) quantum low-density parity-check (qLDPC) codes across different chain-complex dimensions, enabling universal fault-tolerant quantum computation with low overhead. The construction leverages the product structure of LP codes to implement one-way transversal CNOTs between a 3D code and its 2D component codes, enabling teleportation-based switching with geometrically nonlocal gates. Combined with constant-depth CCZ gates in 3D LP codes and low-overhead transversal Clifford gates in 2D LP codes, this yields universal, high-rate quantum logical computation with high thresholds and low space-time costs. Beyond asymptotic schemes, we identify explicit 3D-2D LP code pairs supporting cup-product CCZ gates, including bivariate tricycle-bicycle families such as the $[[81,3,5]]$-$[[54,2,6]]$ pair, where the 3D tricycle codes admit depth-2 CCZ, weight-6 stabilizers, and pseudo-thresholds $\gtrsim 0.4\%$. As a byproduct, we show that the 3D codes enable highly efficient magic-state preparation: a single round of stabilizer measurements followed by depth-2 CCZ and postselection produces states with error $&
Large language models (LLMs) have presented outstanding performance in code generation and completion. However, fine-tuning these models on private datasets can raise privacy and proprietary concerns, such as the leakage of sensitive personal information. Differentially private (DP) code generation provides theoretical guarantees for protecting sensitive code by generating synthetic datasets that preserve statistical properties while reducing privacy leakage concerns. However, DP code generation faces significant challenges due to the strict syntactic dependencies and the privacy-utility trade-off. We propose PrivCode, the first DP synthesizer specifically designed for code datasets. It incorporates a two-stage framework to improve both privacy and utility. In the first stage, termed "privacy-sanitizing", PrivCode generates DP-compliant synthetic code by training models using DP-SGD while introducing syntactic information to preserve code structure. The second stage, termed "utility-boosting", fine-tunes a larger pre-trained LLM on the synthetic privacy-free code to mitigate the utility loss caused by DP, enhancing the utility of the generated code. Extensive experiments on four LL
Tanner codes are long error correcting codes obtained from short codes and a graph, with bits on the edges and parity-check constraints from the short codes enforced at the vertices of the graph. Combining good short codes together with a spectral expander graph yields the celebrated expander codes of Sipser and Spielman, which are asymptotically good classical LDPC codes. In this work we apply this prescription to the left-right Cayley complex that lies at the heart of the recent construction of a $c^3$ locally testable code by Dinur et al. Specifically, we view this complex as two graphs that share the same set of edges. By defining a Tanner code on each of those graphs we obtain two classical codes that together define a quantum code. This construction can be seen as a simplified variant of the Panteleev and Kalachev asymptotically good quantum LDPC code, with improved estimates for its minimum distance. This quantum code is closely related to the Dinur et al. code in more than one sense: indeed, we prove a theorem that simultaneously gives a linearly growing minimum distance for the quantum code and recovers the local testability of the Dinur et al. code.
We consider decoding of vertically homogeneous interleaved sum-rank-metric codes with high interleaving order $s$, that are constructed by stacking $s$ codewords of a single constituent code. We propose a Metzner--Kapturowski-like decoding algorithm that can correct errors of sum-rank weight $t <= d-2$, where $d$ is the minimum distance of the code, if the interleaving order $s > t$ and the error matrix fulfills a certain rank condition. The proposed decoding algorithm generalizes the Metzner--Kapturowski(-like) decoders in the Hamming metric and the rank metric and has a computational complexity of $\tilde{O}(\max(n^3, n^2 s))$ operations in $\mathbb{F}_{q^m}$, where $n$ is the length of the code. The scheme performs linear-algebraic operations only and thus works for any interleaved linear sum-rank-metric code. We show how the decoder can be used to decode high-order interleaved codes in the skew metric. Apart from error control, the proposed decoder allows to determine the security level of code-based cryptosystems based on interleaved sum-rank metric codes.
Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis--such as scientific papers to code--primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task-relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowledge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commercial agents such as Cursor and Claude Code, and crucially, surpassing PhD-lev
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs, which can be challenging to scale due to the dependence on a teacher model and high generation costs. In this paper, we focus on synthesizing code data at scale and propose a \textbf{Case2Code} task by exploiting the expressiveness and correctness of programs. \textbf{Case2Code} is an inductive inference task that aims to infer underlying code implementations by observing input-output examples or program behaviors, By incorporating LLMs to generate program inputs, and executing the program with these inputs to obtain the program outputs, we can synthesize diverse and high-quality \textbf{Case2Code} data at scale for training and evaluating code LLMs. Experimental results show that case-to-code induction is challenging for current representative LLMs if they are untrained. Models trained with \textbf{Case2Code} improve performance not only on distribution case-to-code induction but also on various coding-generation tasks, demonstrating the great potential of large-scale synthetic data and inductive learnin
Watermarking is a technique to help identify the source of data points, which can be used to help prevent the misuse of protected datasets. Existing methods on code watermarking, leveraging the idea from the backdoor research, embed stealthy triggers as watermarks. Despite their high resilience against dilution attacks and backdoor detections, the robustness has not been fully evaluated. To fill this gap, we propose DeCoMa, a dual-channel approach to Detect and purify Code dataset waterMarks. To overcome the high barrier created by the stealthy and hidden nature of code watermarks, DeCoMa leverages dual-channel constraints on code to generalize and map code samples into standardized templates. Subsequently, DeCoMa extracts hidden watermarks by identifying outlier associations between paired elements within the standardized templates. Finally, DeCoMa purifies the watermarked dataset by removing all samples containing the detected watermark, enabling the silent appropriation of protected code. We conduct extensive experiments to evaluate the effectiveness and efficiency of DeCoMa, covering 14 types of code watermarks and 3 representative intelligent code tasks (a total of 14 scenario
While "Intent-oriented programming" (or "Vibe Coding") redefines software engineering, existing code agents remain tethered to static code snapshots. Consequently, they struggle to model the critical information embedded in the temporal evolution of projects, failing to leverage the "reasoning trajectories" implicit in past successful practices. This limitation results in rigid behavioral logic and a lack of autonomous adaptability, ultimately hindering their ability to tackle complex, repository-level problems. To bridge this static-dynamic mismatch, we propose MemCoder, a framework designed to enable continual human-AI co-evolution. MemCoder first structures historical human experience to distill latent intent-to-code mappings from past commits. It then employs a self-refinement mechanism driven by verification feedback to correct agent behavior in real-time. Crucially, an experience self-internalization mechanism is introduced to crystallize human-validated solutions into long-term knowledge, thereby supporting sustained evolution. Experimental results on SWE-bench Verified demonstrate that MemCoder not only achieves State-of-the-Art (SOTA) performance but also delivers a 9.4% i
Code generation aims to understand the problem description and generate corresponding code snippets, where existing works generally decompose such complex tasks into intermediate steps by prompting strategies, such as Chain-of-Thought and its variants. While these studies have achieved some success, their effectiveness is highly dependent on the capabilities of advanced Large Language Models (LLMs) such as GPT-4, particularly in terms of API calls, which significantly limits their practical applicability. Consequently, how to enhance the code generation capabilities of small and medium-scale code LLMs without significantly increasing training costs is an appealing challenge. In this paper, we suggest that code comments are the natural logic pivot between natural language and code language and propose using comments to boost the code generation ability of code LLMs. Concretely, we propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy. Experiments are performed on HumanEval and MBPP, utilizing StarCoder and WizardCoder as backbone models, and encompassing model parameter sizes between
Private information retrieval (PIR) schemes (with or without colluding servers) have been proposed for realistic coded distributed data storage systems. Star product PIR schemes with colluding servers for general coded distributed storage system were constructed over general finite fields by R. Freij-Hollanti, O. W. Gnilke, C. Hollanti and A. Karpuk in 2017. These star product PIR schemes with colluding servers are suitable for the storage of files over small fields and can be constructed for coded distributed storage system with large number of servers. In this paper for an efficient storage code, the problem to find good retrieval codes is considered. In general if the storage code is a binary Reed-Muller code the retrieval code needs not to be a binary Reed-Muller code in general. It is proved that when the storage code contains some special codewords, nonzero retrieval rate star product PIR schemes with colluding servers can only protect against small number of colluding servers. We also give examples to show that when the storage code is a good cyclic code, the best choice of the retrieval code is not cyclic in general. Therefore in the design of star product PIR schemes with
Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-prob