Skip to content

Reflexion: Store Lessons From Failures

If the travel assistant makes a mistake today, such as treating "easy walking" as "add more attractions", it may repeat that mistake tomorrow.

Maker-Checker can repair one draft. Reflexion goes further: after a failure, write a lesson and use that lesson on the next similar task.

One Sentence

Reflexion turns failure-and-retry into failure → lesson → memory → next attempt, so repeated tasks do not start from zero.

What Breaks Without It

Problem What it looks like Risk
Each task is isolated Simple Same mistake repeats
Feedback stays in one run Can fix once Forgotten next time
Lessons are vague Sounds reflective Cannot be retrieved or applied

What This Pattern Changes

Who Owns
Model Answers, writes lessons after failure, retries
Verifier Decides pass/fail
Memory store Stores lessons
Python Controls rounds, reads/writes memory, traces

A lesson should be short, specific, and executable: State the answer as a single number.

Walk Through One Trace

Round Answer Verification Lesson / next
1 bad answer Fails; expected 42 Lesson: answer only as a number
2 42 Passes Return

For travel, a lesson might be: For easy-walking trips, cap stops at three unless the user asks otherwise.

Flow

flowchart TD
  T["Task"] --> M["Read relevant lessons"]
  M --> A["Generate answer"]
  A --> V["Verify"]
  V -->|pass| O["Final answer"]
  V -->|fail| L["Write lesson"]
  L --> S["Store in memory"]
  S --> M

Code Walk

The example uses an in-memory KV store:

kv = InMemoryKV()

The verifier decides pass/fail:

def verify(answer: str) -> VerificationResult:
    ok = answer.strip() == "42"
    return VerificationResult(ok=ok, feedback="Expected exactly: 42" if not ok else "")

Full example:

from __future__ import annotations

from pathlib import Path

from agent_patterns_lab.patterns.reflexion import VerificationResult, reflexion
from agent_patterns_lab.runtime import InMemoryKV, MockLLM, Tracer


def main() -> None:
    tracer = Tracer()
    kv = InMemoryKV()

    model = MockLLM(
        [
            "bad answer",
            '{"lesson":"State the answer as a single number."}',
            "42",
        ]
    )

    def verify(answer: str) -> VerificationResult:
        ok = answer.strip() == "42"
        return VerificationResult(ok=ok, feedback="Expected exactly: 42" if not ok else "")

    out = reflexion(
        model,
        task="What is 6 * 7?",
        verify=verify,
        memory_get=kv.get,
        memory_set=kv.set,
        tracer=tracer,
        rounds=2,
    )

    print(out)
    trace_path = tracer.export_jsonl(Path(".traces") / "42_reflexion.jsonl")
    print(f"[trace] {trace_path}")


if __name__ == "__main__":
    main()

Run:

UV_CACHE_DIR=.uv_cache PYTHONPATH=src uv run --no-sync python examples/42_reflexion.py

Nearby Patterns

Pattern Who decides next Use when
Maker-Checker Current draft is revised One run needs feedback
Reflexion Failed lesson is stored Similar tasks repeat
Memory Preferences or facts persist Long-term context matters
CoVe Claims are verified Factual correctness matters

When To Use It

  • Similar tasks repeat.
  • There is a verifier or reliable feedback.
  • Lessons can be written as short checks.
  • You can maintain memory quality.

When Not To Use It

  • The task is one-off.
  • There is no reliable verification signal.
  • Lessons may mislead future tasks.
  • High-risk memory has no review process.

Costs And Common Failures

Failure Symptom Fix
Vague lesson "Be careful" Write concrete checks
Memory pollution Bad lesson repeats Add review, expiry, namespaces
Retrieval miss Lesson exists but is not used Add tags/task keys
Overfitting to memory Old lesson distorts new task Keep verifier in the loop

Reflexion fits systems that learn from verifiable failures.

For one-run revision, read Maker-Checker. For preferences inside the current session, read Conversation History.

References