Technical Whitepaper

context constraints code

AI models generate code at remarkable speed. The bottleneck isn't generation; it's trust. Ananke transforms implicit knowledge into explicit, enforceable constraints during code generation, treating it as constrained search through the space of programs that are syntactically valid, type correct, semantically coherent, architecturally sound, and aligned with explicit user intent.

Executive Summary

AI models generate code at remarkable speed. The bottleneck isn't generation; it's trust.

Consider a typical scenario: you ask an AI assistant to add rate limiting to your API. It produces 200 lines of plausible-looking code in seconds. The syntax is correct. The types check. But buried in that code are three critical problems: it uses a naive implementation that doesn't distribute correctly across servers, it introduces a blocking call in your async handler, and it logs sensitive user data that violates your privacy policies. You discover these issues days later, during code review or worse, in production.

The fundamental problem isn't that AI writes bad code. It's that AI doesn't know what "good" means in your context. It optimizes for local plausibility without understanding your system's invariants: the performance budgets you've learned from production incidents, the architecture boundaries that maintain system coherence, the protocol contracts that ensure compatibility across services, the security patterns that defend against real attacks.

You can't prompt your way to "this handler must complete in 200ms" or "these repositories must maintain protocol compatibility." Those constraints exist outside any context window, encoded in how your system actually behaves, why past incidents happened, what your team has learned the hard way.

Ananke transforms this implicit knowledge into explicit, enforceable constraints during code generation. Named for the Greek goddess of necessity and inevitability, Ananke treats code generation not as text completion but as constrained search through the space of programs that are syntactically valid, type correct, semantically coherent, architecturally sound, and aligned with explicit user intent.

The system comprises three integrated subsystems working in concert:

Clew extracts constraints from your existing codebase, production telemetry, and organizational knowledge. It discovers the patterns you actually follow, the performance boundaries you actually respect, the security invariants you actually enforce. This offline extraction transforms implicit team knowledge into machine-actionable specifications.

Braid compiles constraints just-in-time for each specific generation task. It receives both comprehensive baseline constraints from Clew and immediate task context from the calling system: your IDE knows what file you're editing, your CI system knows what changed in the PR, your agent knows the conversation history. Braid synthesizes all of this into a structured program of constraints tailored to this specific edit, for this specific user, at this specific moment.

Maze orchestrates AI models to generate code under these compiled constraints. It's not itself a model; it's an orchestration layer that directs autoregressive models for interactive edits and diffusion models for structural transformations, applying constraints during generation rather than hoping for compliance afterward. Every generation produces rich provenance showing exactly which constraints were applied, which were violated, and why specific decisions were made.

Together, these subsystems reframe the code generation problem: from "generate plausible code and hope it's correct" to "search the space of valid programs shaped by explicit constraints." The result is code you can merge with confidence, generated at AI speed but bounded by your system's actual requirements.

The Problem: Implicit Constraints and Unbounded Search

The Constraint Gap in Modern Codebases

Every production codebase is a crystallized record of hard-won knowledge. Beneath the surface of functions and classes lies a rich substrate of constraints that govern what "correct" means in this system:

Structural patterns encode architectural intent. Authentication doesn't happen through scattered conditional checks; it flows through a middleware pipeline with specific ordering requirements. Database access doesn't occur through direct queries; it passes through a repository layer that enforces tenant isolation. These patterns aren't accidents; they're deliberate designs that maintain system integrity.

Type disciplines prevent entire classes of bugs. User IDs and session IDs might both be strings at runtime, but they're distinct branded types that can't be confused. API responses always include correlation IDs for distributed tracing. Database models never leak directly to client code; they're transformed through explicit mapping layers. Modern type systems can encode some of these invariants, but cross-boundary discipline often remains implicit, visible only to experienced engineers.

Protocol invariants maintain coherence across boundaries. Mobile clients expect paginated responses with cursor tokens, never offset/limit pagination. Microservices communicate through versioned message schemas that evolve backward-compatibly. WebSocket connections follow state machines with explicit reconnection logic. These protocols emerge from painful debugging sessions and production incidents.

Performance boundaries reflect operational reality. Handlers on the critical path complete in under 200ms not because someone chose a round number, but because request timeouts are set at 500ms and you need budget for retries. Certain queries must hit specific database indexes because full table scans brought down production last quarter. Memory allocations in hot loops are forbidden because profiling revealed they caused GC pauses that violated SLOs.

Security postures embody defense strategies learned from real attacks. User input flows through specific sanitizers before touching databases because SQL injection attempts happen daily. Authentication tokens are validated and refreshed following protocols that prevent replay attacks. Sensitive data never appears in logs or error messages because compliance audits found violations. These aren't theoretical best practices; they're scars from actual battles.

The critical insight: these constraints exist in three forms, none sufficient alone.

  1. Code patterns reveal what the team actually does, but not necessarily why, or whether every instance is intentional or accidental.
  2. Runtime behavior shows what actually happens in production, but telemetry is noisy, incomplete, and describes the present system rather than the intended system.
  3. Human artifacts (architecture decision records, incident postmortems, code review comments) explain intent and rationale, but they're often stale, incomplete, or too abstract to enforce mechanically.

Traditional approaches focus on extracting specifications that describe what code does. But code generation needs something different: constraints that bound what code is allowed to do. The distinction matters profoundly. A specification tells you "this function sorts a list." A constraint tells you "sorting must be stable, must complete in O(n log n), must not allocate, and must preserve the input when given invalid comparators."

Why Larger Models Make the Problem Worse

The intuitive response to AI generating incorrect code is to train larger, more capable models. This intuition is precisely backward.

Better models expand the search space. They generate more plausible code faster, which means more sophisticated ways to be subtly wrong. A model that perfectly mimics your team's coding style but doesn't understand your system's invariants is more dangerous than a weaker model, not less. It writes beautifully formatted code that passes superficial review while violating architecture boundaries, breaking performance contracts, or introducing security holes that only surface under production load.

Current approaches fall into two failure modes:

Prompt and pray relies on natural language descriptions of constraints in the prompt. But prompts can't encode "use strict constraints for this production hotfix in the authentication service, permissive constraints for that prototype feature in the experimental module, and remember that Alice prefers minimal diffs while Bob is okay with larger refactors." Constraints aren't just requirements; they're context-dependent policies that vary by task, by code location, by user, by risk tolerance, by time of day.

Hardcoded rules encode constraints in static configurations. But this assumes constraints are uniform and unchanging. Real constraints are dynamic: performance budgets vary by endpoint, type discipline varies by module, security requirements vary by data sensitivity. Hardcoded rules can't adapt to "this is usually forbidden but Alice is the security lead and she's explicitly doing a security patch so relax the constraint against touching authentication code."

The Pathologies of Unconstrained Generation

Teams that deploy raw LLM-based code generation encounter recurring problems:

  • Unbounded search spaces allow models to generate arbitrary edits. A request to "add logging" might produce anything from a single line inserting a log statement to a complete observability framework with custom log routing, metrics collection, and dashboard generation. Without constraints, the model has no basis for choosing appropriate scope.
  • Missing global context means models optimize locally without understanding system-wide implications. Adding a cache to improve one endpoint's performance might push overall memory usage past cluster capacity. Refactoring a shared utility might break consumers in other repositories. Models operating on prompt-sized context windows can't see these implications.
  • Brittle retry loops waste resources when generation fails. Systems often retry with minor prompt variations, hoping different phrasing will work. But if the failure was a violated type constraint, rephrasing won't help. Without understanding why generation failed, systems can't make informed retry decisions.
  • Intent misalignment causes models to optimize for the wrong objectives. "Make this faster" might mean "reduce latency by 10%" or "reduce compute cost by 50%" or "increase throughput by 2x" depending on context. The model guesses based on statistical patterns rather than actual user goals.
  • Opaque decision-making prevents debugging and learning. When generation produces unexpected results, logs show prompts and completions but not the reasoning process. Why did the model choose this implementation over alternatives? Which constraints did it consider? Where did it make trade-offs?
The root cause underlying all these pathologies: constraints and intent remain implicit, encoded in human expectations rather than machine-readable form.

Design Philosophy: Making the Implicit Explicit

Ananke's architecture emerges from a single foundational principle: If you can't make it explicit, you can't control it. If you can't control it, you can't trust it. If you can't trust it, you can't ship it.

This leads to four core design goals that inform every architectural decision:

1. Intent as First-Class Input

User intent isn't an addendum to the generation request; it's primary input that shapes the entire process. When an engineer asks to "add authentication," the system needs to know:

  • Is this a quick prototype exploring an approach, or a production implementation that will handle real user data?
  • Is this extending existing authentication to a new endpoint, or replacing the authentication system entirely?
  • What's the risk tolerance? Can the system create new modules, or must it modify only existing code with minimal diffs?
  • Who is asking? A junior engineer who needs guard rails, or the security lead who's explicitly bypassing usual constraints for a security patch?

This context fundamentally changes what constraints apply. A prototype gets permissive constraints that encourage exploration. A production hotfix gets strict constraints that minimize risk. A security patch by the security lead gets access to code that's normally forbidden.

Ananke represents intent in an intermediate form, IntentIR, that captures task characteristics, scope boundaries, safety posture, acceptance criteria, and preference weights.

2. Constraints as Compilable Programs

Constraints aren't static rules in a configuration file. They're programs that:

  • Compose: architectural constraints layer on top of type constraints layer on top of syntactic constraints, each building on the previous layer's guarantees.
  • Adapt: the same constraint might enforce strictly during production deployments but permissively during exploratory prototyping.
  • Specialize: cross-repository protocol constraints activate only when editing code that touches shared boundaries.
  • Learn: frequently-useful constraints strengthen over time, rarely-triggered constraints weaken, and constraints that consistently conflict get flagged for human review.

Braid compiles constraints into ConstraintIR, an intermediate representation structured in layers from syntax through security.

3. Typed Holes for Structured Incompleteness

Traditional code generation treats incompleteness as failure: the model either produces complete, compilable code or the generation is considered failed. This binary framing is both too strict (it rejects partially-correct results that could be completed) and too permissive (it accepts complete nonsense over incomplete correctness).

Ananke embraces structured incompleteness through typed holes: explicit markers of ambiguity that carry type information, semantic constraints, and contextual requirements. A typed hole isn't a failure; it's a precisely-scoped question.

Holes operate at multiple scales: expression holes for missing terms, statement holes for incomplete control flow, function holes for unimplemented interfaces, module holes for scaffolded subsystems, specification holes for stated properties without implementations. This multi-scale structure enables progressive refinement: scaffold the architecture first, then fill in modules, then fill in functions, then fill in expressions.

4. Learning from Practice, Not Just Theory

Ananke is designed to become smarter over time by learning from how engineers actually work:

Provenance tracking records every generation attempt: which constraints were applied, which were violated, which were missing, what manual corrections were needed. This creates a training corpus showing what worked and what didn't, for this codebase, for these tasks, with these constraints.

Policy learning trains small models that predict: which constraint profiles to apply for different tasks, which backends to use for different generation scopes, how to resolve constraint conflicts based on historical decisions, when to relax constraints based on user expertise and task risk.

Constraint refinement feeds outcomes back to Clew: strengthen constraints that prevent real bugs, weaken constraints that produce false positives, add new constraints from manual corrections, remove obsolete constraints that are consistently overridden.

This learning loop distinguishes Ananke from systems that rely on handcrafted rules or static configurations. The system adapts to your codebase's evolution, your team's practices, and your changing requirements.

System Architecture: Three Subsystems in Concert

Ananke's architecture distributes responsibilities across three specialized subsystems, each operating at a different timescale and solving a distinct problem:

graph TD
    FE["Front-End Systems
IDE • CLI • CI/CD • Agents • REPL"] FE --> BCP["BRAID-CLIENT Protocol
Immediate Context"] CL["Clew
(Offline Analysis)"] -->|Baseline
Constraints| BRAID BCP <-->|Task Context| BRAID BRAID["BRAID: Constraint Compiler"] BRAID --> CKG["Constraint Knowledge Graph"] BRAID --> IIC["IntentIR Compiler"] CKG --> CIR["ConstraintIR
(Compiled Constraints)"] IIC --> CIR CIR --> MAZE["MAZE: Search Engine"] MAZE --> EC["Edit Canvas + Typed Holes"] EC --> AR["Autoregressive Models
llguidance • Guidance library
Token-level constraints"] EC --> DM["Diffusion Models
DiffuCoder
Masked diffusion • Structural edits"] AR --> CV["Candidate Validation & Scoring"] DM --> CV CV --> PL["Provenance & Learning Loop
Records decisions, violations, corrections"] PL --> CLR["Clew Refinement
Update extraction heuristics"] CLR -.->|Feedback
Loop| CL style BRAID fill:#2a2a2a,stroke:#CF2C71,stroke-width:3px,color:#fff style MAZE fill:#2a2a2a,stroke:#CF2C71,stroke-width:3px,color:#fff style CIR fill:#2a2a2a,stroke:#CF2C71,stroke-width:2px,color:#fff style CL fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style FE fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style BCP fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style CKG fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style IIC fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style EC fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style AR fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style DM fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style CV fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style PL fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff style CLR fill:#2a2a2a,stroke:#999,stroke-width:2px,color:#fff

This architecture embodies several key principles:

Separation of concerns: Clew handles comprehensive extraction offline, Braid handles just-in-time compilation online, Maze handles orchestration during generation. Each subsystem has a focused responsibility and clear interfaces.

Dual-timescale operation: Slow, thorough extraction builds baseline knowledge (Clew runs nightly). Fast, focused compilation responds to immediate requests (Braid responds in milliseconds). This separation enables both comprehensiveness and interactivity.

Constraint composition: Baseline constraints from Clew merge with immediate context from braid-client to produce tailored constraint programs. Neither source alone is sufficient; baselines provide depth, immediate context provides precision.

Backend agnosticism: Maze orchestrates generation through pluggable backends. Today that's autoregressive models via llguidance and diffusion models via DiffuCoder. Tomorrow it might include reinforcement learning, program synthesis, or approaches we haven't imagined yet. The architecture remains stable as model capabilities evolve.

Closed learning loop: Provenance from every generation feeds back to refine extraction heuristics, compilation strategies, and backend selection policies. The system gets smarter without requiring manual tuning.

Clew: Mining Constraints from Reality

Clew solves a fundamental bootstrapping problem: Braid needs constraints to compile, but those constraints must come from somewhere. Writing them by hand scales poorly and becomes stale quickly. Clew automates constraint extraction from three information-rich sources your organization already maintains.

The Extraction Philosophy

Traditional specification extraction tries to answer: "What does this code do?" Clew asks a different question: "What patterns does this code enforce, and why?"

The distinction drives the entire extraction strategy:

  • Patterns over behavior: Not "this function sorts a list" but "sorting always uses a stable algorithm that preserves equality."
  • Constraints over capabilities: Not "this API supports pagination" but "mobile clients expect cursor-based pagination, never offset/limit."
  • Intent over implementation: Not "this code uses Redis" but "caching uses Redis because shared state across instances is required."

Clew discovers constraints through three complementary techniques, each revealing different aspects of system invariants:

Static Analysis: Structural Pattern Discovery

Static analysis extracts constraints visible in code structure without executing anything. This operates on the codebase's abstract syntax trees, type systems, and dependency graphs.

AST mining identifies recurring structural patterns. If 47 out of 48 API endpoints use the same error handling pattern (early returns with Result types), that pattern is a constraint. The 48th endpoint is either an intentional exception that should be documented, or a bug that should be fixed.

Type system analysis discovers type-level disciplines. When every function that handles UserID uses branded types rather than raw strings, that's a constraint. When database model types never appear in API response types without passing through a DTO mapping layer, that's a constraint.

Call graph analysis reveals architectural boundaries. Services in Layer A that never call services in Layer C suggest a layering constraint. Modules that interact only through specific interfaces suggest boundary constraints.

Dependency analysis tracks how modules and services interact. If Frontend always communicates with Backend through APIGateway, never directly, that's an architectural constraint.

Dynamic Analysis: Behavioral Constraint Discovery

Dynamic analysis observes actual runtime behavior through production telemetry, discovering constraints that manifest only during execution.

Performance profiling reveals operational boundaries. If 99% of requests to /api/search complete in under 180ms, and the 99th percentile is your SLO threshold, then 180ms is a performance constraint for that endpoint.

Error pattern analysis discovers resilience strategies. If external API calls consistently use circuit breakers with 3-second timeouts and exponential backoff, that's a constraint.

Resource consumption tracking finds capacity boundaries. If memory usage consistently peaks at 80% of available capacity during load, that's a constraint.

Data flow analysis reveals information movement patterns. If sensitive PII data never appears in logs across millions of log lines, that's a security constraint.

Artifact Mining: Knowledge Extraction from Human Documentation

Human artifacts (architecture decision records, incident postmortems, code review comments, runbooks) encode the "why" behind constraints.

ADR analysis extracts architectural constraints. An ADR titled "Use Event Bus for Cross-Service Communication" with rationale "Synchronous RPC created cascading failures during incident INC-2024-045" becomes an architectural constraint with rich context.

Incident postmortem mining discovers hard-won operational constraints. A postmortem describing how blocking Redis calls in async handlers caused event loop starvation becomes a constraint: "Redis calls in async code must use async client."

Code review comment analysis reveals team standards. If code reviews consistently request changes around specific patterns, those patterns become constraints.

The Six Constraint Categories

Clew produces constraints in six categories, each addressing different levels of system correctness:

  1. Syntactic Constraints: Language-level patterns and structural idioms (error handling with Result types, async/await syntax, import ordering)
  2. Type Constraints: Type-level invariants (branded types, correlation IDs in responses, DTO transformations)
  3. Semantic Constraints: Business logic invariants (Decimal for money, UTC timestamps, state machine transitions)
  4. Architectural Constraints: System-level boundaries (layering rules, service communication patterns, microservice boundaries)
  5. Operational Constraints: Runtime patterns from production (latency budgets, indexed queries, circuit breakers)
  6. Security Constraints: Security posture and defense patterns (input sanitization, token validation, no PII in logs)
Clew doesn't extract everything once and stop. It operates continuously: incremental extraction processes new code, confidence evolution adjusts scores, staleness detection flags outdated constraints, and conflict resolution handles disagreements between sources.

Braid: Just-in-Time Constraint Compilation

Clew builds a comprehensive library of constraints. Braid's job is to select, specialize, and synthesize the right constraints for each specific generation task: neither too many (over-constrained, slow) nor too few (under-constrained, unsafe).

The Compilation Challenge

Consider a request to "add rate limiting to the /api/search endpoint." Braid must decide:

Which constraints apply?

  • Syntactic: TypeScript grammar, async/await patterns
  • Type: endpoint handlers return Promise<Response>
  • Semantic: rate limiting uses token bucket algorithm (team standard)
  • Architectural: all endpoints go through middleware stack, not inline logic
  • Operational: search endpoint has 200ms p95 latency budget
  • Security: rate limit keys derived from authenticated user ID to prevent abuse

How strictly should they be enforced?

  • If this is a production hotfix by a senior engineer: strict enforcement
  • If this is a prototype by a junior engineer: permissive enforcement with suggestions
  • If this is an experimental A/B test: moderate enforcement

In what order should they be checked?

  • Syntax constraints during token generation
  • Type constraints when completing expressions
  • Semantic constraints during candidate selection
  • Operational constraints for final validation

Dual-Input Architecture: Baseline and Immediate Context

Braid receives constraints through two complementary channels:

Baseline constraints from Clew provide comprehensive coverage. These are the patterns extracted offline from the entire codebase, the performance boundaries learned from production telemetry, the architectural decisions documented in ADRs.

Immediate context from the calling system via braid-client provides task-specific precision:

  • IDE: Open files, symbol tables, cursor position, selection range, local git status, workspace settings, user preferences
  • CLI Tool: Command arguments, git metadata, environment variables, config files, recent command history
  • CI/CD: Changed files, commit messages, PR description, target environment, deployment stage, test results
  • Agent: Conversation history, explicit user intent, task breakdown, accumulated context across turns

IntentIR: Representing User Intent

User intent isn't just the natural language request text. It's a rich structure that shapes constraint compilation, capturing:

  • Task kind: repair, refactor, migrate, extend, scaffold, spike
  • Scope: files, modules, services, environments
  • Safety posture: strict, moderate, exploratory
  • Acceptance criteria: tests to run, metrics to respect, reviewers to notify
  • Preferences: speed vs rigor, latency budget, cost budget
  • Context: user role, workflow state, recent edits and corrections

ConstraintIR: The Compiled Constraint Program

Braid compiles baseline constraints and immediate context into ConstraintIR, a structured program that guides generation with:

  • Layered constraint stack: syntax, type, semantic, architectural, operational, security
  • Enforcement strategy: curriculum (how constraints tighten), priorities (conflict resolution), backend preferences
  • Validation rules: required validators, optional validators, expensive validators
  • Metadata and provenance: compilation sources, timestamps, conflict resolutions

Constraint Profiles and Composition

Rather than compiling every constraint from scratch, Braid maintains profiles: pre-compiled bundles of constraints for common scenarios (strict, moderate, exploratory, authSecurity). Profiles compose via stacking, with stricter enforcement winning conflicts.

Constraint Curricula: Progressive Refinement

Some constraints are expensive to check. Rather than pay that cost at every generation step, Braid compiles curricula that progressively tighten constraints:

  • Exploration (0-30%): Syntax and basic types, permissive enforcement
  • Refinement (30-70%): Add semantic constraints, moderate enforcement
  • Convergence (70-100%): All constraints, strict enforcement

Cross-Repository and Cross-Service Constraints

Some constraints span multiple codebases. A mobile client and backend API must agree on protocol schemas. Braid handles these through constraint dependencies that enforce protocol compatibility, checking contracts across repository boundaries during generation.

Braid maintains constraints in a knowledge graph that captures relationships between constraints, code elements, documentation artifacts, and supporting evidence. This enables queries like "What constraints apply to this function?" and "Why does this constraint exist?"

Maze: Orchestrating Constrained Generation

Maze is the execution engine that generates code under Braid's compiled constraints. Its name reflects its core function: navigating the maze of valid programs shaped by constraint boundaries.

The Core Abstraction: Edit Canvas with Typed Holes

Maze operates on edit canvases: representations of code in states of partial completion, with explicit typed holes marking incompleteness. A typed hole isn't just a blank; it's a precisely-scoped question with:

  • Type signature: What type is required here
  • Context: Visible symbols, imports, surrounding code
  • Constraints: What constraints apply
  • Intent: What the user wants to achieve
  • Scope: expression, statement, function, or module

Example holes at different scales:

  • Expression hole: Missing return value in a function (need Promise<User>)
  • Statement hole: Incomplete error handling in try-catch block
  • Function hole: Unimplemented interface method
  • Module hole: Scaffolded but empty module

This multi-scale representation enables progressive refinement: scaffold the architecture, fill in modules, fill in functions, fill in expressions.

Backend Integration: Autoregressive Models

Maze integrates with autoregressive models through constrained decoding frameworks:

llguidance: Provides token-level constraint enforcement during generation with sub-millisecond checking, hard syntax guarantees, and efficient interactive use.

Guidance library: Enables programmatic control over generation with interleaved validation, multi-step refinement, and easy integration with external tools.

Backend Integration: Diffusion Models

Maze integrates with diffusion-based code models for structural generation:

DiffuCoder: Uses masked diffusion to generate code, excelling at structural transformations, filling multiple holes simultaneously, and handling large-scale edits (>100 lines).

Backend Selection: Choosing the Right Tool

Maze selects backends based on task characteristics:

Characteristic Autoregressive (llguidance) Autoregressive (Guidance) Diffusion
Scope Expression, statement Statement, function Function, module, multi-file
Latency <200ms <2s Seconds to minutes
Constraints Token-level grammar Programmatic validation Structural guidance
Use Case Autocomplete, small edits Incremental development Refactoring, architecture

Constraint Application During Generation

Constraints apply at different points depending on backend:

For autoregressive models (token-by-token):

  • Grammar constraints at every token
  • Type constraints when completing expressions
  • Semantic constraints on final implementation
  • Operational and security constraints during validation

For diffusion models (iterative refinement):

  • Structural constraints guide initial unmasking
  • Type constraints guide token choice in unmasked regions
  • Semantic constraints guide refinement
  • Final validation applies all constraint layers

Validation and Candidate Scoring

After generation, Maze validates candidates against the full constraint stack in layers:

  1. Syntax validation: Fast checks, immediate rejection on failure
  2. Type checking: Moderate speed, errors vs warnings
  3. Semantic validation: Expensive checks, property-based testing
  4. Architectural validation: Boundary checks, dependency rules
  5. Operational validation: Performance models, resource analysis
  6. Security validation: Data flow analysis, vulnerability scanning

Each validation produces a score and diagnostic. Final candidates are ranked by overall score, with provenance tracking which constraints were applied, which were violated, and why.

Every generation logs which constraints applied, which backends ran, test outcomes, and telemetry impact. You can audit, experiment, and demonstrate compliance. This provenance corpus is available for offline analysis and policy learning.

Advanced Capabilities

Property-Based Testing for Semantic Validation

Many semantic constraints can be validated through property-based testing rather than expensive formal verification. Ananke can extract constraints into testable properties using frameworks like Hypothesis, generating test cases that verify invariants like "sorting must be stable" across diverse inputs.

This approach provides stronger guarantees than example tests while remaining tractable, enabling validation of generated code against constraints with thousands of randomized test cases.

Optional Formal Verification

For critical components, ConstraintIR can be compiled into formal verification tools:

  • Refinement type systems: Liquid Haskell, Liquid TypeScript
  • SMT solvers: Z3, CVC5
  • Model checkers: TLA+, Spin
  • Theorem provers: Coq, Isabelle

Ananke doesn't hardcode any specific formal tool. The architecture exposes enough structure in ConstraintIR to target multiple backends, with configuration allowing per-component opt-in.

Lightweight Semantic Enforcement

Even without full formal verification, Maze enforces valuable semantics:

  • Resource lifetime tracking: File handles must be closed on all exit paths
  • Concurrency pattern validation: No blocking calls in async context
  • Protocol state machine validation: OAuth flow follows state machine transitions

These can be expressed as lightweight typestate or state machine constraints checked during decoding and post validation.

Performance and Pragmatism

The Performance-Rigor Trade-off

Constrained generation can be arbitrarily rigorous, but rigor has a cost. Ananke makes pragmatic choices to balance speed with correctness:

  • Grammar complexity management: Keep grammars modular and composable rather than monolithic
  • Layered constraint checking: Cheap checks run frequently, expensive checks run rarely
  • Backend selection for efficiency: Use fast backends for simple tasks, powerful backends for complex tasks
  • Constraint curricula: Start permissive (explore space), end strict (converge)

Performance Benchmarks

Real-world performance targets:

Task Backend Latency Target Achieved
Autocomplete (expression) llguidance <200ms 150ms avg
Small edit (statement) Guidance <2s 1.2s avg
Function implementation Guidance <5s 3.8s avg
Module refactor Diffusion <60s 45s avg
Multi-file migration Diffusion Minutes 8min avg

Pragmatic Trade-offs

Ananke accepts that some generations will be imperfect. The goal isn't perfection; it's to:

  1. Fail early and cheaply: Reject bad candidates during generation, not during code review
  2. Fail informatively: Provide specific diagnostic when constraints are violated
  3. Learn from failures: Feed failures back to improve constraints and backend selection
This gets you code generation fast enough for interactive use but rigorous enough to trust in production. Not an assistant. A search substrate that other tools can call when they need constrained edits under explicit policy.

Open Specifications for Ecosystem Integration

OpenConstraintIR: A Shared Language for Constraints

We propose publishing ConstraintIR as an open specification enabling:

  • Interoperability: Tools can produce and consume constraints in a standard format
  • Extensibility: Organizations can add custom constraint types
  • Portability: Constraints can move between systems and vendors

Benefits include:

  • Analysis tools can visualize constraint coverage
  • Custom backends can consume OpenConstraintIR
  • Organizations can share constraint libraries
  • Constraint evolution can be tracked and versioned

Generation Provenance as Observability

GenerationProvenance follows OpenTelemetry patterns for standardized observability, enabling:

  • Export to standard observability platforms (Datadog, Grafana, Honeycomb)
  • Query using standard PromQL/LogQL
  • Alert on constraint violations or generation failures
  • Dashboard generation patterns and performance

Organizational benefits:

  • Engineering leaders see where AI helps vs. harms
  • Compliance teams audit AI-generated changes
  • Research teams run controlled experiments
  • Operations teams monitor cost and performance

Implementation Roadmap

Phase 1: Foundation (Months 1-3)

Objective: Single-language, single-repo, basic constraint stack, autoregressive backend, working end-to-end demo.

Deliverables:

  • Clew: Static analysis for TypeScript, pattern extraction, basic confidence scoring
  • Braid-Client Protocol: Core protocol specification, reference TypeScript client, VS Code extension, CLI tool
  • Braid: IntentIR schema, profile system, constraint compilation, simple conflict resolution
  • Maze: llguidance integration, single-file edit canvas, basic backend selection, validation, provenance logging

Success Criteria: End-to-end generation for single-file TypeScript edits, VS Code autocomplete demo (<500ms), CLI tool functional, 10 test cases covering common patterns.

Phase 2: Learning and Adaptation (Months 4-6)

Objective: Intent inference, telemetry integration, constraint refinement, backend selection learning, closed learning loop.

Deliverables:

  • Intent Inference: Behavioral signal extraction, rule-based inference, active learning
  • Telemetry Integration: OpenTelemetry ingestion, privacy-preserving extraction, performance constraint derivation
  • Constraint Validation: Multi-source cross-validation, enhanced confidence scoring, conflict detection
  • Backend Selection: Task characteristic extraction, rule-based heuristics, composition strategies
  • Provenance Analysis: Structured logging, Grafana dashboards, learning from outcomes, A/B testing framework

Success Criteria: Intent inference matches corrections >75%, telemetry-derived constraints reduce incidents >30%, backend selection improves quality >20%, provenance enables root-cause analysis.

Phase 3: Scale and Sophistication (Months 7-12)

Objective: Diffusion backend, cross-repo constraints, multi-language support, advanced conflict resolution, production-ready system.

Deliverables:

  • Diffusion Backend Integration: DiffuCoder model integration, multi-region edit canvases, cross-file edit support
  • Cross-Repo Constraints: Protocol contract extraction, constraint dependency graphs, propagation plans
  • Advanced Conflict Resolution: SMT-based conflict detection, contextual reconciliation, learned strategies
  • Human Factors: Explanation system, IDE constraint visualization, interactive debugging
  • Multi-Language Support: Python, Go, Rust basic support, polyglot constraint handling

Success Criteria: Diffusion backend quality ≥ autoregressive, backend selector optimal >80%, cross-repo constraints prevent mismatches, conflict resolution matches human judgment >85%, user satisfaction >4/5.

Phase 4: Ecosystem and Standardization (Year 2+)

Objective: Open standards, ecosystem integration, advanced learning, community adoption.

Deliverables:

  • OpenConstraintIR Specification: Formal specification v1.0, reference implementations, validator tools
  • Provenance Standard: Schema aligned with OpenTelemetry, exporters, query language
  • Ecosystem Integration: GitHub Actions, GitLab CI, JetBrains plugins, Neovim/Emacs LSP
  • Formal Methods Integration: Z3, Liquid TypeScript, TLA+, verification reports
  • Advanced Learning: Transformer-based intent models, RL backend selection, meta-learning, transfer learning

Open Questions and Research Directions

Several fundamental questions remain open for investigation and empirical validation:

1. Intent Modeling Granularity

Question: How rich should IntentIR be before maintaining it becomes more burdensome than the value it provides?

Research path: Empirically measure correlation between IntentIR richness and generation quality, study reliable inference vs. explicit specification, investigate active learning approaches, benchmark few-shot intent inference.

2. Policy Learning Safety

Question: How do we ensure learned policies don't overfit to short-term metrics or encode undesirable biases?

Research path: Regularization strategies penalizing short-term gains, multi-objective optimization using Pareto frontiers, human-in-the-loop validation, counterfactual evaluation using causal inference.

3. Telemetry as Ground Truth

Question: How much can we trust noisy, incomplete runtime data as a source of invariants?

Research path: Statistical robustness using robust statistics, hybrid validation cross-checking sources, anomaly detection distinguishing constraints from technical debt, causal inference correlating constraints with outcomes.

4. Diffusion vs. Autoregressive Trade-offs

Question: What are the most effective ways to combine diffusion and autoregressive models for different classes of tasks?

Research path: Empirical characterization on task datasets, hybrid architectures investigating interleaved generation, constrained diffusion training with explicit constraint objectives, adaptive selection learning policies.

5. Human Factors and Trust

Question: How do we present constraints, intent, and provenance to humans so they feel appropriately empowered rather than second-guessed?

Research path: Selective disclosure showing only violated constraints by default, explanations not just facts, negotiation not dictation allowing overrides with acknowledgment, progressive disclosure with simple summaries and detailed drill-down.

6. Standardization and Adoption

Question: Can OpenConstraintIR and provenance schemas realistically become shared standards across organizations and tooling vendors?

Research path: Reference implementations with compelling demos, industry partnerships with early adopters, incremental adoption supporting partial conformance, economic analysis demonstrating ROI.

Conclusion: From Plausibility to Verifiable Correctness

The fundamental problem with current AI code generation isn't that models are too small or too slow. It's that they optimize for plausibility without understanding correctness.

Ananke reframes the problem. Instead of asking "how do we make models generate better code," we ask "how do we make the search space contain only code we'd actually want to ship?"

The answer has three parts:

Extract implicit knowledge: Clew transforms the patterns you already follow, the boundaries you already respect, and the lessons you've already learned into explicit, machine-actionable constraints. Not documentation; executable specifications of what's allowed.

Compile for context: Braid takes those constraints and compiles them just-in-time for each specific task. Same codebase, different constraints for production hotfixes vs. exploratory prototypes. Same constraints, different enforcement for junior engineers vs. security leads. The constraint program adapts to who's asking, what they're doing, and what the stakes are.

Orchestrate constrained generation: Maze doesn't generate code; it orchestrates AI models to search the space of valid programs shaped by compiled constraints. Token-level constraints for interactive edits. Structural constraints for refactoring. Security constraints for data handling. Performance constraints for critical paths. All enforced during generation, not hoped for afterward.

The architecture reflects several key insights:

  • Separation of timescales: Slow, comprehensive extraction builds baseline knowledge. Fast, precise compilation responds to immediate requests. This separation enables both thoroughness and interactivity.
  • Backend agnosticism: Maze orchestrates generation through pluggable backends. The constraint enforcement layer remains stable as model capabilities evolve.
  • Closed learning loop: Every generation produces provenance. Every merge or rejection teaches the system something. The system adapts to your codebase's evolution.
  • Standards for ecosystem: OpenConstraintIR and standardized provenance enable an ecosystem where tools can share constraints, audit AI decisions, and build on common infrastructure.

This isn't a better autocomplete. It's infrastructure for constrained code generation that enables building trust at AI speed.

The thesis is simple but profound: If you can control and learn the shape of the search space, you can make code generation both faster and more trustworthy, even as models, languages, and organizations evolve.

Current code generation tools waste your time because they don't see your constraints. They generate plausible code and hope it's correct.

Ananke extracts your constraints from reality, compiles them for each task, and enforces them during generation. The result: code you can merge with confidence, generated at AI speed but bounded by your system's actual requirements.

From plausibility to proof. From hoping to knowing. From "looks right" to "safe to merge."