Evaluation Report

Constrained Code Generation

A comparative study of constraint-guided vs. unconstrained LLM code generation.

Abstract

We present an empirical evaluation of Ananke, a constraint-guided code generation system that enforces structural, syntactic, and semantic constraints during LLM inference. Using a benchmark of 15 programming tasks spanning 12 categories, we compare constrained generation against unconstrained baseline generation using the same underlying language model.

Our evaluation measures functional correctness via unit tests (following the pass@k methodology from HumanEval), code quality metrics, constraint adherence, and generation efficiency. Results demonstrate that constrained generation achieves a +14.5 point improvement in overall quality score, with constrained generation winning on 15/15 tasks (100%).

Summary Statistics

15/15

Constrained Wins

+14.5

Avg. Quality Delta

33%

Constrained pass@1

20%

Baseline pass@1

1. Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, unconstrained generation often produces code that, while syntactically correct, may not adhere to project-specific patterns, naming conventions, or structural requirements. This limitation becomes critical in enterprise environments where code must conform to established style guides, API contracts, and security policies.

Ananke addresses this challenge through constrained decoding, a technique that modifies the token probability distribution during generation to ensure outputs satisfy specified constraints. Unlike post-hoc validation or prompt engineering approaches, constrained decoding guarantees that generated code structurally conforms to requirements from the first token.

1.1 Research Questions

RQ1 (Correctness): Does constrained generation maintain or improve functional correctness compared to unconstrained generation?
RQ2 (Quality): How does constraint enforcement affect code quality metrics including readability, complexity, and security?
RQ3 (Efficiency): What is the computational overhead of constraint enforcement during generation?

2. Methodology

2.1 Benchmark Design

Our benchmark consists of 15 programming tasks designed to evaluate code generation across diverse domains. Following best practices from HumanEval and MBPP, each task includes:

A natural language description of the required functionality
Explicit requirements specifying function signatures, behavior, and constraints
A comprehensive unit test suite for functional correctness verification
Reference implementation for quality comparison
Constraint specification in Ananke format (regex patterns, type constraints, structural requirements)

2.2 Task Categories

Category	Tasks	Description
algorithms	3	Classic algorithms (sorting, searching, graph traversal)
api	1	API request handling, validation, and response formatting
concurrency	1	Rate limiting, async operations, synchronization
data_processing	2	General programming tasks
data_structures	1	General programming tasks
database	1	Query building, connection handling
file_io	1	General programming tasks
mathematics	1	General programming tasks
security	1	Input sanitization, validation, secure coding
string_processing	1	General programming tasks
system_utilities	1	General programming tasks
web_components	1	General programming tasks

2.3 Evaluation Metrics

Metric	Weight	Methodology
Functional Correctness	Primary	Pass rate on unit test suite (pass@1 equivalent)
Constraint Adherence	25%	Verification of export statements, type annotations, naming conventions
Pattern Conformity	25%	Sliding window similarity to reference implementation structure
Code Quality	25%	Readability (line length, nesting), complexity, conciseness
Security	25%	Detection of dangerous patterns (eval, raw SQL), input validation presence

3. Experimental Setup

3.1 Configuration

Model	`Qwen/Qwen2.5-Coder-7B-Instruct` (via vLLM)
Hardware	NVIDIA H100 x1 (Modal)
Max Tokens	4096
Temperature	0.0
Constraint Backend	llguidance
Constraint Integration	Braid → vLLM regex-guided decoding

3.2 Ananke Pipeline

The constrained generation pipeline consists of three stages:

Clew (Extraction): Parse source code to extract structural patterns and constraints
Braid (Compilation): Compile constraints into llguidance-compatible format
Maze (Generation): Generate code with constraints enforced via vLLM structured outputs

3.3 Constraint Enforcement vs. Scoring

Important distinction: This evaluation uses a two-tier constraint system. Generation-time enforcement (regex) guarantees function signature correctness via vLLM's regex-guided decoding. Post-generation scoring evaluates rich constraints (type_constraints, naming_constraints, etc.) for quality metrics.

4. Results

4.1 pass@k Metrics (HumanEval-style)

Following the methodology from Chen et al. (2021), we report pass@k which measures the probability that at least one of k samples passes all tests:

Metric	Constrained	Baseline	Delta
Correct Tasks	5	3	+2
pass@1	33.3%	20.0%	+13.3%

4.2 Per-Task Results

Task	Category	Difficulty	Baseline	Constrained	Delta
`algo_001_binary_search`	algorithms	simple	75.0	90.0	+15.0
`algo_002_merge_sort`	algorithms	moderate	75.0	90.0	+15.0
`algo_003_graph_dfs`	algorithms	medium	72.2	85.8	+13.6
`api_001_request_validator`	api	simple	65.8	84.2	+18.3
`async_001_rate_limiter`	concurrency	medium	74.9	90.6	+15.7
`data_001_csv_parser`	data_processing	simple	75.6	91.8	+16.2
`data_002_json_validator`	data_processing	medium	67.2	83.4	+16.2
`db_001_query_builder`	database	medium	68.6	83.5	+14.9
`ds_001_lru_cache`	data_structures	medium	73.7	83.5	+9.8
`fileio_001_log_analyzer`	file_io	medium	70.5	88.2	+17.8
`math_001_prime_generator`	mathematics	simple	75.0	90.0	+15.0
`security_001_input_sanitizer`	security	moderate	72.6	88.2	+15.7
`string_001_url_parser`	string_processing	simple	75.0	79.3	+4.3
`system_001_config_parser`	system_utilities	simple	76.8	91.8	+15.0
`web_001_form_validator`	web_components	simple	72.6	87.8	+15.2

4.3 Timing Analysis

Phase	Baseline (ms)	Constrained (ms)	Notes
Constraint Compilation	N/A	4	Braid compilation (Ananke only)
LLM Generation	25,158	4,869	vLLM inference time
Test Execution	4,600	3,885	Running unit tests

Constraint Compilation Overhead: The Ananke constraint compilation (Braid → llguidance) adds approximately 4ms per task. This one-time cost is amortized across multiple generations using the same constraints and can be cached for production use.

5. Analysis

5.1 Key Findings

RQ1 (Correctness): Constrained generation achieves comparable or better functional correctness. This demonstrates that constraint enforcement does not negatively impact the model's ability to generate functionally correct code.

RQ2 (Quality): Constrained generation shows a consistent improvement in quality metrics, with an average delta of +14.5 points (σ=3.4). The improvement is primarily driven by constraint adherence (100% vs 50%), indicating that explicit constraint enforcement effectively guides the model toward desired code patterns.

RQ3 (Efficiency): When the model is warm, constrained generation shows 81% speedup compared to baseline. This is consistent with findings from JSONSchemaBench that constrained decoding can actually improve throughput through reduced token sampling space.

5.2 Constraint Adherence Analysis

The most significant improvement from constrained generation is in constraint adherence (100% vs 50%). This includes:

Export statements: Constrained generation always produces properly exported functions
Type annotations: TypeScript type signatures are consistently correct
Naming conventions: Function and variable names match requirements

5.3 Limitations

Sample size: Evaluation covers 15 tasks; larger benchmarks would increase statistical power
Single model: Results are specific to Qwen2.5-Coder-7B; larger models may show different patterns
pass@1 only: We measure single-sample correctness; pass@k with k>1 would provide additional insight
TypeScript only: Evaluation focuses on TypeScript; multi-language evaluation is future work

6. Conclusion

This evaluation demonstrates that Ananke's constrained code generation approach provides measurable benefits over unconstrained generation. Constrained generation wins on 15/15 tasks with an average quality improvement of +14.5 points.

The primary benefit of constraint enforcement is ensuring structural compliance with project requirements, as evidenced by the 100% constraint adherence rate. This is particularly valuable in enterprise environments where code must conform to established patterns and conventions.

Future work will expand the benchmark to include additional languages, implement pass@k evaluation with multiple samples per task, and evaluate against larger language models.

References

[1] Chen et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 (HumanEval)
[2] Austin et al. "Program Synthesis with Large Language Models." arXiv:2108.07732 (MBPP)
[3] Geng et al. "JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for LLMs." arXiv:2501.10868
[4] Zheng et al. "Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review." arXiv:2406.12655