Evaluation Report

Constrained Code Generation

A comparative study of constraint-guided vs. unconstrained LLM code generation.

Abstract

We present an empirical evaluation of Ananke, a constraint-guided code generation system that enforces structural, syntactic, and semantic constraints during LLM inference. Using a benchmark of 15 programming tasks spanning 12 categories, we compare constrained generation against unconstrained baseline generation using the same underlying language model.

Our evaluation measures functional correctness via unit tests (following the pass@k methodology from HumanEval), code quality metrics, constraint adherence, and generation efficiency. Results demonstrate that constrained generation achieves a +14.5 point improvement in overall quality score, with constrained generation winning on 15/15 tasks (100%).

Summary Statistics

15/15
Constrained Wins
+14.5
Avg. Quality Delta
33%
Constrained pass@1
20%
Baseline pass@1

1. Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, unconstrained generation often produces code that, while syntactically correct, may not adhere to project-specific patterns, naming conventions, or structural requirements. This limitation becomes critical in enterprise environments where code must conform to established style guides, API contracts, and security policies.

Ananke addresses this challenge through constrained decoding, a technique that modifies the token probability distribution during generation to ensure outputs satisfy specified constraints. Unlike post-hoc validation or prompt engineering approaches, constrained decoding guarantees that generated code structurally conforms to requirements from the first token.

1.1 Research Questions

  1. RQ1 (Correctness): Does constrained generation maintain or improve functional correctness compared to unconstrained generation?
  2. RQ2 (Quality): How does constraint enforcement affect code quality metrics including readability, complexity, and security?
  3. RQ3 (Efficiency): What is the computational overhead of constraint enforcement during generation?

2. Methodology

2.1 Benchmark Design

Our benchmark consists of 15 programming tasks designed to evaluate code generation across diverse domains. Following best practices from HumanEval and MBPP, each task includes:

  • A natural language description of the required functionality
  • Explicit requirements specifying function signatures, behavior, and constraints
  • A comprehensive unit test suite for functional correctness verification
  • Reference implementation for quality comparison
  • Constraint specification in Ananke format (regex patterns, type constraints, structural requirements)

2.2 Task Categories

Category Tasks Description
algorithms3Classic algorithms (sorting, searching, graph traversal)
api1API request handling, validation, and response formatting
concurrency1Rate limiting, async operations, synchronization
data_processing2General programming tasks
data_structures1General programming tasks
database1Query building, connection handling
file_io1General programming tasks
mathematics1General programming tasks
security1Input sanitization, validation, secure coding
string_processing1General programming tasks
system_utilities1General programming tasks
web_components1General programming tasks

2.3 Evaluation Metrics

Metric Weight Methodology
Functional CorrectnessPrimaryPass rate on unit test suite (pass@1 equivalent)
Constraint Adherence25%Verification of export statements, type annotations, naming conventions
Pattern Conformity25%Sliding window similarity to reference implementation structure
Code Quality25%Readability (line length, nesting), complexity, conciseness
Security25%Detection of dangerous patterns (eval, raw SQL), input validation presence

3. Experimental Setup

3.1 Configuration

ModelQwen/Qwen2.5-Coder-7B-Instruct (via vLLM)
HardwareNVIDIA H100 x1 (Modal)
Max Tokens4096
Temperature0.0
Constraint Backendllguidance
Constraint IntegrationBraid → vLLM regex-guided decoding

3.2 Ananke Pipeline

The constrained generation pipeline consists of three stages:

  1. Clew (Extraction): Parse source code to extract structural patterns and constraints
  2. Braid (Compilation): Compile constraints into llguidance-compatible format
  3. Maze (Generation): Generate code with constraints enforced via vLLM structured outputs

3.3 Constraint Enforcement vs. Scoring

Important distinction: This evaluation uses a two-tier constraint system. Generation-time enforcement (regex) guarantees function signature correctness via vLLM's regex-guided decoding. Post-generation scoring evaluates rich constraints (type_constraints, naming_constraints, etc.) for quality metrics.

4. Results

4.1 pass@k Metrics (HumanEval-style)

Following the methodology from Chen et al. (2021), we report pass@k which measures the probability that at least one of k samples passes all tests:

MetricConstrainedBaselineDelta
Correct Tasks53+2
pass@133.3%20.0%+13.3%

4.2 Per-Task Results

TaskCategoryDifficultyBaselineConstrainedDelta
algo_001_binary_searchalgorithmssimple75.090.0+15.0
algo_002_merge_sortalgorithmsmoderate75.090.0+15.0
algo_003_graph_dfsalgorithmsmedium72.285.8+13.6
api_001_request_validatorapisimple65.884.2+18.3
async_001_rate_limiterconcurrencymedium74.990.6+15.7
data_001_csv_parserdata_processingsimple75.691.8+16.2
data_002_json_validatordata_processingmedium67.283.4+16.2
db_001_query_builderdatabasemedium68.683.5+14.9
ds_001_lru_cachedata_structuresmedium73.783.5+9.8
fileio_001_log_analyzerfile_iomedium70.588.2+17.8
math_001_prime_generatormathematicssimple75.090.0+15.0
security_001_input_sanitizersecuritymoderate72.688.2+15.7
string_001_url_parserstring_processingsimple75.079.3+4.3
system_001_config_parsersystem_utilitiessimple76.891.8+15.0
web_001_form_validatorweb_componentssimple72.687.8+15.2

4.3 Timing Analysis

PhaseBaseline (ms)Constrained (ms)Notes
Constraint CompilationN/A4Braid compilation (Ananke only)
LLM Generation25,1584,869vLLM inference time
Test Execution4,6003,885Running unit tests

Constraint Compilation Overhead: The Ananke constraint compilation (Braid → llguidance) adds approximately 4ms per task. This one-time cost is amortized across multiple generations using the same constraints and can be cached for production use.

5. Analysis

5.1 Key Findings

RQ1 (Correctness): Constrained generation achieves comparable or better functional correctness. This demonstrates that constraint enforcement does not negatively impact the model's ability to generate functionally correct code.

RQ2 (Quality): Constrained generation shows a consistent improvement in quality metrics, with an average delta of +14.5 points (σ=3.4). The improvement is primarily driven by constraint adherence (100% vs 50%), indicating that explicit constraint enforcement effectively guides the model toward desired code patterns.

RQ3 (Efficiency): When the model is warm, constrained generation shows 81% speedup compared to baseline. This is consistent with findings from JSONSchemaBench that constrained decoding can actually improve throughput through reduced token sampling space.

5.2 Constraint Adherence Analysis

The most significant improvement from constrained generation is in constraint adherence (100% vs 50%). This includes:

  • Export statements: Constrained generation always produces properly exported functions
  • Type annotations: TypeScript type signatures are consistently correct
  • Naming conventions: Function and variable names match requirements

5.3 Limitations

  • Sample size: Evaluation covers 15 tasks; larger benchmarks would increase statistical power
  • Single model: Results are specific to Qwen2.5-Coder-7B; larger models may show different patterns
  • pass@1 only: We measure single-sample correctness; pass@k with k>1 would provide additional insight
  • TypeScript only: Evaluation focuses on TypeScript; multi-language evaluation is future work

6. Conclusion

This evaluation demonstrates that Ananke's constrained code generation approach provides measurable benefits over unconstrained generation. Constrained generation wins on 15/15 tasks with an average quality improvement of +14.5 points.

The primary benefit of constraint enforcement is ensuring structural compliance with project requirements, as evidenced by the 100% constraint adherence rate. This is particularly valuable in enterprise environments where code must conform to established patterns and conventions.

Future work will expand the benchmark to include additional languages, implement pass@k evaluation with multiple samples per task, and evaluate against larger language models.

References