Introduction to Constrained Decoding
[ llm structured-output constrained-decoding ] · 11 min read

While writing the LSP article, I noticed that smaller models continuously failed to output in the expected structure. It was especially prone to producing malformed JSON that caused parsing errors and downstream failures. I wondered, how people deal with this kind of situation where we expect a specific structured output from LLMs, as I suspect this is a very common scenario and there must be a proper way to handle this. This led me down this rabbit hole and I learned a lot about structured output and more importantly, the constrained decoding technique, which I found very interesting. This article is an introductory overview of how to make LLMs conform with the expected structured output.

LLM Structured Output

Structured Output refers to when you expect responses from LLM that conform to specific structure (e.g., JSON, XML, SQL) instead of free-form text.

Because LLMs are essentially next-token prediction systems, without any guardrails or mechanisms to ensure structured output, you will find that more often than not they have a hard time generating output that conforms to a predefined structure/schema, especially if the expected output structure is complex.

The problem with malformed output is - you will have parsing errors causing broken workflow, tool call failures and such. Especially, as more and more complex workflows are being built with LLM responses in its core, a crucial part would be to ensure proper structured output to avoid cascading failures.

Few Imperfect Solutions

There have been a few approaches to ensure LLM outputs conform to a predefined structure, with varying degree of effectiveness:

  1. Few shot prompt engineering: The idea here is to provide a well crafted prompt with:

    • Few-shot examples: Show the model 2-3 examples of desired output
    • Explicit format instructions: “Respond in JSON format with <fields> and <formats>
    • XML tags for structure: Use <data>, <instructions>, <output> to organize the prompt

    Results: This doesn’t guarantee 100% compliance, but improves reliability. Overall effectiveness varies depending on:

    • Model: More capable and high end models perform more reliably than smaller ones.
    • Task Complexity: How complex is the schema or the overall task. Also the nature of task matters too. For example narrative style task performs worse.
    • Retry: Incorporating retry mechanism has reported increased reliability.
  2. Fine-tuning: The idea here is to do post-training on lots of structured output examples, so that the model calibrates itself for generating structured output.

    Results: Reduces prompt tokens but still not 100% reliable. Plus fine-tuning adds a cost overhead.

  3. Post-processing/Parsing: The idea is to add retry loops, partial JSON repair, schema validation etc.

    Results: Again not 100% reliable. It varies on model, implementations. More importantly it adds latency and cost, making it not optimal.

The Instructor python library combines few shot prompt engineering along with configurable retry loops. It converts Pydantic models to prompt instructions, validates LLM responses, and retries on validation failures. This library reportedly achieves very high accuracy with modern models.

Bottom Line: None of the above mentioned techniques could reliably guarantee 100% compliance with reported effectiveness in the range of 70-95%.

Constrained Decoding

Constrained Decoding is a technique to manipulate token generation at inference time to guarantee format compliance.

The core idea is to compute which tokens are valid given the current state and required structure. Invalid tokens are masked (i.e., their probability is set to zero or negative infinity) before sampling. This makes it impossible to generate non-compliant output.

Key insight: By design it is 100% schema compliant as we will invalidate tokens that are not allowed during the generation process.

How it works

LLM token prediction refresher

First, let’s review how the token generation process happens in LLM:

Masked Decoding

Constrained decoding applies a binary mask to the output logits, marking which tokens are allowed or not. Following image illustrates the masked token generation process -

LLM generation with masked decoding
Figure 1: Text generation with masked decoding

So at a high level, we look at what has been generated and what the required structure is. Then we only sample from the tokens that will ensure the structure conformity.

Constrained Decoding
Figure 2: Structure conformity

Implementation approaches

The core part of any constrained decoding implementation is how it tracks the generated state and the required structure to produce the binary mask. There are a few different approaches for this.

Let’s look at a simplified implementation of how masking is generated from a predefined schema with a simplified FSM.

Assume this is our schema with two properties (name and age), one string and one integer:

schema = {
  "type": "object",
  "properties": {
      "name": {"type": "string"},
      "age": {"type": "integer"}
  }
}

So this would be a valid structure conforming the above schema:

{
    "name": "arnab",
    "age": 30
}

What we want to implement is given the following generated text, we want to produce a binary mask where only numbers are allowed:

{
    "name": "arnab",
    "age": 
}

Step 1: Schema to Regex

The first step is to convert the JSON schema into Regex. Here’s a simplified implementation. For real projects with complex structures, we would use libraries like interegular:

def schema_to_regex(schema: Dict) -> str:
  if schema.get("type") == "object":
      properties = schema.get("properties", {})
      patterns = []
      for key, value_schema in properties.items():
          if value_schema.get("type") == "string":
              patterns.append(f'"{key}"\\s*:\\s*"[^"]*"')
          elif value_schema.get("type") == "integer":
              patterns.append(f'"{key}"\\s*:\\s*\\d+')

      # Build full JSON object pattern
      inner_pattern = "\\s*,\\s*".join(patterns)
      return f"\\{{\\s*{inner_pattern}\\s*\\}}"

  return ""

You can call the schema_to_regex function like this:

schema = {
  "type": "object",
  "properties": {
      "name": {"type": "string"},
      "age": {"type": "integer"}
  }
}

# Step 1: Schema -> Regex
regex_pattern = schema_to_regex(schema)
print(f"Regex pattern: {regex_pattern}")
# It will print the converted Regex pattern like following:
# > Regex pattern: \{\s*"name"\s*:\s*"[^"]*"\s*,\s*"age"\s*:\s*\d+\s*\}

Step 2: Regex to FSM

Next step is to build an FSM to maintain a state machine that will check a text against the compiled regex pattern and return valid transitions.

class SimpleFSM:
  def __init__(self, regex_pattern: str):
      self.pattern = regex_pattern
      self.compiled_regex = re.compile(regex_pattern)

  def get_allowed_next_chars(self, current_text: str) -> Set[str]:
      allowed = set()
      # Test each possible next character
      test_chars = list('abcdefghijklmnopqrstuvwxyz0123456789"{}: ,')
      for char in test_chars:
          test_str = current_text + char
          # Check if this could be a valid partial match
          if self._is_valid_prefix(test_str):
              allowed.add(char)

      return allowed

  def _is_valid_prefix(self, text: str) -> bool:
      # Simple heuristic: try to match partial string
      # Real implementations use DFA state transitions
      try:
          # Check if any string starting with this prefix could match
          pattern = f"^{re.escape(text)}"
          return bool(re.search(pattern, self.pattern)) or self.compiled_regex.match(text + "}" * 10)  # Simplified check
      except:
          return False

fsm = SimpleFSM(regex_pattern)
current_generation = '{"name": "arnab", "age": '
allowed_chars = fsm.get_allowed_next_chars(current_generation)
print(f"Allowed next characters: {allowed_chars}")
# > Allowed next characters: {'4', '1', '9', '7', '3', '8', '2', '5', '6', '0'}

Not to mention this is very simplified. Ideally the state machine would have a mapping for state transition. But to illustrate my point, here’s what’s happening:

Step 3: Create binary token mask

Finally we will have to produce a binary mask array for allowed tokens. Ideally you would use a Tokenizer for vocabulary, but I will use a sample vocabulary containing lowercase letters, digits and few JSON punctuation characters.

Our vocabulary: abcdefghijklmnopqrstuvwxyz0123456789"{}: ,

And the token masker would look like this:

class TokenMasker:
  def __init__(self, fsm: SimpleFSM):
      self.fsm = fsm
      # example vocabulary: [abcdefghijklmnopqrstuvwxyz0123456789"{}: ,]
      self.vocab = [chr(i) for i in range(97, 123)] + [str(i) for i in range(10)] + ['"', '{', '}', ':', ',', ' ']
      self.vocab_size = len(self.vocab)

  def get_token_mask(self, current_text: str) -> List[bool]:
      mask = [False] * self.vocab_size

      # Get allowed next characters from FSM
      allowed_chars = self.fsm.get_allowed_next_chars(current_text)

      # Update Mask for each allowed token in vocabulary
      for token_id in range(self.vocab_size):
          token_str = self.vocab[token_id]
          if token_str and token_str[0] in allowed_chars:
                  mask[token_id] = True
      return mask

masker = TokenMasker(fsm)
current_generation = '{"name": "arnab", "age": '
mask = masker.get_token_mask(current_generation)
print(f"Token mask: {mask}")
# You will see only the indices for digits are set to True
# > Token mask: [False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False]

For real world implementations, you can take a look at following repositories:

Notes on Quality and Performance

While constrained decoding will ensure 100% syntactic compliance with the expected structure, it might drive models away from preferred tokens, potentially reducing semantic quality.

Also the key order in JSON affects generation quality due to the sequential nature of generation.

On performance side, deeply nested or recursive structures increase mask computation overhead.

How do you use it?

Commercial Models with API

If you are using any cloud or enterprise model provider, almost all of them as of writing this article (Jan 2026) support native constrained decoding for structured output generation. Although with varying degrees of support. See the table below for more information:

ProviderFeatureGuaranteeNotes
OpenAIStructured Outputs100% schema compliancestrict: true in function definitions; OpenAI claims 100% compliance and reports gpt-4o scores perfect on their internal evals
OpenAIJSON ModeValid JSON onlyNo schema guarantee; deprecated in favor of Structured Outputs
AnthropicStructured Outputs (Beta)Schema compliancestrict: true + beta header; available for Claude Sonnet 4.5+
Anthropic (via Bedrock)Tool UseHigh reliability (not constrained)Tool calling uses prompt training, NOT constrained decoding; reliability similar to Instructor approach (95%+ range)
GoogleFunction CallingSchema complianceVia Gemini API
AWS Bedrock (Nova models only)Native Constrained Decoding100% schema complianceAutomatic grammar generation from tool schemas; guarantees syntactically valid JSON; overall tool use accuracy ~95% due to semantic errors (AWS blog)
AWS BedrockCustom Model Import100% schema complianceReal-time constrained generation for imported models

Self hosted or custom inference

If you are hosting models yourself and using inference engines like vLLM, SGLang, you have a few options on how you can add constrained decoding to your inference engine.

See the following tables for major libraries and implementations of constrained decoding and integration with inference engines:

Major Libraries and Implementations

LibraryDeveloperKey FeaturesPerformance
OutlinesdottxtFSM-based, regex/JSON schema, widely integratedO(1) token lookup, but token-by-token state transitions
GuidanceMicrosoftPython DSL for constraints, token-level control, KV-cache optimizationfaster inference on some benchmarks
LLGuidanceMicrosoftRust core for speed, CFG support~50us per token for 128k tokenizer
lm-format-enforcerNoam GatJSON Schema + Regex, beam search supportFlexible whitespace/ordering
XGrammarMLC/CMUContext-independent token precomputation, PDA-basedUp to 100x speedup, <40us/token

Inference Engine Integration

EngineDefault BackendNotes
vLLMXGrammar (or Outlines, Guidance)sequential mask generation
SGLangCompressed FSMJump-forward decoding; overlaps mask generation with inference; 2-2.5x faster than alternatives
TensorRT-LLMXGrammarNVIDIA optimized
llama.cppGBNF grammarsNative grammar support

Personal Notes