Skip to content

Voting: Let Multiple Answers Calibrate Each Other

The same model can answer the same question differently. A travel assistant may put the tea museum in the afternoon once and Longjing Village the next time.

If the answer is short and easy to normalize, a cheap stabilizer is to sample several times and vote.

One Sentence

Voting turns one sample into multiple samples plus normalization plus winner selection, trading cost for lower variance.

What Breaks Without It

Problem What it looks like Risk
One sample carries randomness Sometimes great Same task becomes unstable
First candidate wins Fast Majority answer may be missed
Long prose is voted directly Sounds democratic There is no comparable answer key

What This Pattern Changes

Who Owns
Model Generates multiple candidates
Normalizer Converts candidates into comparable keys
Voter / judge Chooses the winner
Python Controls sample count and traces candidates

Walk Through One Trace

The example task is small: Choose A or B.

Sample Raw output Normalized
1 A A
2 B B
3 A A
4 A A
5 B B

A wins because it appears three times.

Flow

flowchart TD
  P["Same prompt"] --> S1["Candidate 1"]
  P --> S2["Candidate 2"]
  P --> S3["Candidate 3"]
  S1 --> N["normalize"]
  S2 --> N
  S3 --> N
  N --> V["vote / judge"]
  V --> O["winner"]

Code Walk

model = MockLLM(["A", "B", "A", "A", "B"])
messages = [Message(role="user", content="Choose A or B.")]
out = self_consistency(model, messages, n=5, normalize=lambda s: s.strip(), tracer=tracer)

Full example:

from __future__ import annotations

from pathlib import Path

from agent_patterns_lab.patterns.voting import self_consistency
from agent_patterns_lab.runtime import Message, MockLLM, Tracer


def main() -> None:
    tracer = Tracer()
    model = MockLLM(["A", "B", "A", "A", "B"])

    messages = [Message(role="user", content="Choose A or B.")]
    out = self_consistency(model, messages, n=5, normalize=lambda s: s.strip(), tracer=tracer)
    print(out)

    trace_path = tracer.export_jsonl(Path(".traces") / "31_voting.jsonl")
    print(f"[trace] {trace_path}")


if __name__ == "__main__":
    main()

Run:

UV_CACHE_DIR=.uv_cache PYTHONPATH=src uv run --no-sync python examples/31_voting.py

Nearby Patterns

Pattern Who decides next Use when
Voting Candidates vote Answers are short and normalizable
Maker-Checker Checker asks for revision Drafts need rubric feedback
CoVe Claims are verified Correctness depends on evidence
LATS Search expands and scores candidates Candidates need multi-step exploration

When To Use It

  • The answer can be normalized into a short key.
  • Extra samples are affordable.
  • You need lower randomness, not external facts.
  • Failures are occasional variance, not systematic ignorance.

When Not To Use It

  • Facts need retrieval or tools.
  • Long-form answers cannot be compared.
  • The model is systematically biased.
  • Latency and cost are tight.

Costs And Common Failures

Failure Symptom Fix
No majority A/B/C are tied Add a judge or fallback to Maker-Checker
Cannot normalize Every long answer differs Use structured output to extract a key
Systematic bias Majority is still wrong Add retrieval, tools, or CoVe
High cost n=5 means five calls Route only hard samples to voting

Voting handles randomness, not missing facts.

For factual checks, read CoVe. For revision from feedback, read Maker-Checker.

References