Home Blog Technical Blog Upon Further Reflection

Upon Further Reflection

Reducing Errors Through a Second Pass

The Reflection Service

This post describes the technical implementation of the Reflection service in the Busy Family conversational agent: a second-pass validation layer that catches errors in tool invocations before they reach the user, and when necessary, rolls back actions and retries with corrections.

1. Introduction: Why Reflection?

Agentic systems that take real-world actions—creating, updating, or deleting calendar events—can cause real harm when they get things wrong. Unlike a chatbot response, a misfired calendar action needs to be undone. The user who asked to "move dinner to next Friday" and instead had it moved to Thursday will see the wrong date on their calendar and may miss the event.

LLMs are capable of subtle reasoning failures. Wrong day-of-week for a relative date like "next Friday." A malformed event ID that concatenates family ID, timestamp, and title instead of using the actual Google Calendar identifier. A location string "home" resolved to "Home Depot" instead of the family's address. These mistakes are not obvious bugs; they emerge from the model's inference, and the tools execute them faithfully.

Reflection is the layer added to catch these failures before they reach the user. It is a second LLM call that plays the role of a skeptical reviewer, checking the first model's tool invocations against what the user actually said. When reflection finds an error, the system rolls back the action and retries with a correction message. The user never sees the wrong outcome.

2. What the Reflection Service Validates

Reflection focuses on three categories of errors, in priority order. These are defined in the system prompt (ReflectionSystemPrompt in RemoteModule.kt):

1. Date/time errors (highest priority) — Does the resolved date actually match what the user said? For example, the user said "next Friday" but the tool call uses a Thursday. This includes checking whether relative references like "tomorrow," "this weekend," and named day-of-week were correctly resolved. A special case: if the user used relative dates and the agent did not call the calculate_datetime tool, reflection checks whether the outcome is nevertheless correct. If the date is wrong, it flags an error; if the date is right but the process was sloppy, it flags a warning.

2. Event ID errors (second priority) — For updatecalendarevent, is the eventId a valid Google Calendar ID format? Valid IDs are typically short alphanumeric strings (e.g., l16venr5bq2eh1cn14f4kjjvlk). Invalid: concatenated IDs with underscores, dates, or names (e.g., familyID20251022T190000Game Night with Family). Malformed IDs cause the API call to fail or update the wrong event.

3. Location errors (third priority) — If the user said "home" or "my home," was the family's home address used rather than a business named "Home" (e.g., Home Depot)? The system prompt instructs reflection not to flag location strings that will be resolved at execution time; only flag when the resolved result is clearly wrong.

Only specific tools trigger reflection. Not every tool call is validated:

private val toolsRequiringReflection = setOf(
    "create_calendar_event",
    "update_calendar_event",
    "bulk_update_calendar_events",
    "delete_calendar_event",
    "get_location_info",
    "get_location_suggestions"
)

These are the tools with real-world side effects or location resolution. Read-only tools are not validated.

3. Where Reflection Sits in the Conversation Flow

Reflection runs after tool execution but before streaming a response to the user. At that point, the agent has already called the calendar API. If reflection finds an error, it rolls back the action and retries. This order is intentional: the agent must actually attempt the operation so reflection can see the resolved arguments (the final tool call payload), not just the plan.

The integration point is in ChatRequestHandler after the tool execution loop:

val originalMessage = originalUserMessages[familyID] ?: ""
val currentRetryCount = reflectionRetryAttempts[familyID] ?: 0
if (originalMessage.isNotEmpty() && currentRetryCount == 0) {
    try {
        // ...
        val validationResult = reflectionService.validate(
            userMessage = originalMessage,
            toolCalls = toolCallsForValidation,
            currentDateTime = currentDateTime,
            familyHomeAddress = homeAddress,
            familyTimezone = tz
        )
        // ...
    }
}

The currentRetryCount == 0 guard is important: reflection with rollback only fires on the first attempt per user turn. This prevents infinite retry loops.

4. The Reflection Prompt and LLM Configuration

Reflection uses a dedicated system prompt and a separate Bedrock call, independent of the main agent's conversation. Key configuration choices:

  • Temperature 0.0 — Deterministic output. Validation should not be creative.
  • Max tokens 1000 — Sufficient for a compact JSON error report.
  • Structured JSON output — The reflection model is instructed to return only JSON with no prose, so parsing is predictable.

The model receives: the original user message, the current date/time (in the user's timezone), the family's home address, and the full list of tool calls (name + serialized arguments). The validation prompt is built in ReflectionService:

val prompt = """
    $reflectionPrompt

    Current Date/Time: $currentDateTime
    Family Home Address: ${familyHomeAddress ?: "Not set"}$dateFactsSection

    User Message: "$userMessage"

    Tool Calls Made:
    ${gson.toJson(toolCallsJson)}

    Validate these tool calls against the user's intent. Return ONLY a JSON response with no additional text:
    {
      "valid": true/false,
      "errors": [
        {"type": "date", "issue": "description of what's wrong", "correction": "what should be done instead"},
        {"type": "location", "issue": "description of what's wrong", "correction": "what should be done instead"}
      ],
      "confidence": "high" or "medium" or "low"
    }

    If no errors found, return: {"valid": true, "errors": [], "confidence": "high"}
""".trimIndent()

The model returns JSON like:

{
  "valid": false,
  "errors": [
    {
      "type": "date",
      "severity": "error",
      "issue": "User asked for 'next Friday' but the date 2025-10-24 is actually a Thursday",
      "correction": "Change the date to 2025-10-25 which is the actual Friday"
    }
  ],
  "confidence": "high"
}

5. Errors vs. Warnings: The Severity Model

Not all findings are equal. The reflection service uses a two-tier severity model defined in ReflectionModels.kt:

enum class ValidationSeverity {
    ERROR,
    WARNING
}
  • ERROR — The outcome is wrong. Action: rollback + retry with corrections.
  • WARNING — The process was imperfect but the outcome appears correct. No rollback; the result stands and confidence is downgraded to MEDIUM.

A common example of a warning: the main agent resolved "this Sunday" to the correct date without calling the calculate_datetime tool—a process shortcut that happened to produce the right answer. Reflection flags it as a warning so the behavior can be improved over time, but does not undo a correct result.

The valid field in the result is true if and only if there are no ERROR-severity findings. Warnings alone leave valid = true. The system prompt states:

IMPORTANT: If ALL findings are severity "warning" (no "error" items), set "valid" to true.
Only set "valid" to false when there is at least one "error" severity finding.

6. Programmatic Date Facts: Anchoring the LLM Against Itself

Fact Anchoring
Fact Anchoring

A key design decision: LLMs can be wrong about day-of-week arithmetic. The reflection model itself might make the same kind of reasoning error it is trying to catch. To guard against this, ReflectionService computes date facts programmatically before calling the model, then injects them into the prompt as verified ground truth:

// Extract dates from tool calls and compute day-of-week programmatically.
// Note: the reflection model is fallible; these computed facts are our ground truth.
val zoneId = familyTimezone
    ?.let { tz -> runCatching { ZoneId.of(tz) }.getOrNull() }
    ?: ZoneId.of("America/Los_Angeles")
val dateFacts = extractDateFacts(toolsToValidate, zoneId)

These facts are labeled VERIFIED DATE FACTS (computed programmatically - these are FACTUAL) in the prompt. The system prompt instructs the reflection model not to contradict them. After the model responds, a second filter (filterIncorrectDayOfWeekErrors) removes any LLM errors that contradict the programmatically-verified facts—ensuring the reflection model cannot introduce a false positive by miscalculating a weekday.

Additionally, when the reflection model flags a MISSINGCALCULATEDATETIME error but the date outcome is verifiably correct, the service downgrades it to a WARNING via downgradeProcessViolationsWhenOutcomeCorrect. The comment explains:

The LLM reflection often flags process violations as errors even when the result is right.

So the system distinguishes between "wrong outcome" (rollback) and "imperfect process, right outcome" (advisory only).

7. Rollback: Undoing Agentic Actions

Rollback
Rollback

When reflection finds hard errors, the system must undo what the agent already did. Each tool type has its own rollback strategy in ChatRequestHandler:

Created events — Deleted via the calendar API using the event ID returned in the tool result's rich elements:

"create_calendar_event" -> {
    if (execution.result is ToolResult.SuccessWithRichElements) {
        for (elem in execution.result.richElements.filterIsInstance<RichElement.CalendarEvent>()) {
            try {
                println("DEBUG: Reflection rollback - deleting created event: ${elem.id}")
                val deleteResult = calendarEventService.deleteEvent(familyID, elem.id, null)
                // ...
            }
        }
    }
}

Updated events — Restored from a pre-update snapshot taken before the tool was called:

"update_calendar_event" -> {
    if (execution.result is ToolResult.SuccessWithRichElements &&
        argumentsModifySchedule(execution.arguments)) {
        for (elem in execution.result.richElements.filterIsInstance<RichElement.CalendarEvent>()) {
            val snapshot = preUpdateSnapshots[elem.id]
            if (snapshot != null) {
                val restoreResult = calendarEventService.restoreEvent(familyID, snapshot)
                // ...
            } else {
                println("WARN: No snapshot for event ${elem.id}, cannot rollback update")
            }
        }
    }
}

Bulk-updated events — Same snapshot/restore pattern, applied to each event individually.

The snapshot mechanism is critical. Before calling updatecalendarevent or bulkupdatecalendar_events, the calendar tools invoke ChatRequestHandler.storePreUpdateSnapshot(eventId, snapshot) to store the current state of the event. CalendarEventService.restoreEvent then applies that snapshot to undo the change. This is a full pre-image restore, not an undo log.

After rollback, the system also cleans up conversation state: the failed tool use and tool result messages are removed from the conversation history, and any rich elements (calendar cards) generated by the failed attempt are cleared from cache so the retry starts clean.

8. What the Retry Receives: The Correction Message

The retry is not a blind re-run. The errors and corrections from reflection are injected into the conversation as a new user-role message before recursively re-invoking the handler:

val correctionMessage = buildString {
    append("CORRECTION NEEDED: Your previous tool invocation had the following errors:\n\n")
    hardErrors.forEach { error ->
        append("- ${error.type.name}: ${error.issue}\n")
        append("  Fix: ${error.correction}\n\n")
    }
    append("IMPORTANT: The original event has been restored to its pre-update state. ")
    append("The event still exists with its original details.\n\n")
    append("Please retry the operation with the corrected parameters. ")
    append("If you cannot determine the correct parameters with confidence, ")
    append("tell the user what went wrong and ask for clarification rather than guessing.")
}

val correctionMsg = Message.builder()
    .role(ConversationRole.USER)
    .content(ContentBlock.fromText(correctionMessage))
    .build()
conversationHistoryStore.addMessage(familyID, correctionMsg)

reflectionRetryAttempts[familyID] = 1
handleChatRequest("", familyID, thinker, isFollowup = true, turnDepth = turnDepth)
return

This message tells the main agent exactly what was wrong and what to do differently—using the natural-language correction string from the reflection result. If the agent cannot determine the correct parameters confidently, the correction message instructs it to tell the user rather than guess.

9. Second-Pass Validation and Confidence

After a retry, reflection runs a second time—but without rollback. This second pass captures the confidence level of the corrected result and uses it to signal the user:

} else if (currentRetryCount > 0) {
    // Second-pass: validate but don't rollback, capture confidence for user-facing status
    try {
        val secondPassResult = reflectionService.validate(
            userMessage = originalMessage,
            toolCalls = toolCallsForValidation,
            currentDateTime = currentDateTime,
            familyHomeAddress = family?.location?.homeCity,
            familyTimezone = tz
        )
        turnConfidence[familyID] = if (!secondPassResult.valid) ConfidenceLevel.LOW else secondPassResult.confidence
    } catch (e: Exception) {
        turnConfidence[familyID] = ConfidenceLevel.MEDIUM
    }
}

The ConfidenceLevel (HIGH, MEDIUM, LOW) is surfaced to the user as a turn-level signal. LOW confidence means the system is uncertain about the result, giving the user a cue to double-check.

10. Limitations and Caveats

An honest accounting of what reflection does not solve:

One retry only. The retry counter is capped at 1 (reflectionRetryAttempts[familyID] = 1). If the correction also fails, the second-pass captures low confidence but the result stands. No second rollback.

Fail-safe defaults to valid. If the reflection API call fails or the response is unparseable, the service returns ReflectionResult(valid = true):

} catch (e: Exception) {
    println("ERROR: ReflectionService validation failed: ${e.message}")
    // On error, assume valid to avoid blocking user
    return ReflectionResult(valid = true)
}

A Bedrock outage silently bypasses reflection rather than blocking the user.

Snapshot is required for rollback. If the pre-update snapshot is missing (e.g., the tool failed before a snapshot could be taken, or the snapshot was not stored), the system logs a warning but cannot restore the event.

Rich elements required for create rollback. createcalendarevent must return rich elements (event IDs) for reflection to be able to delete the created event. The system forces preferRichElements = true for these tools:

val toolsThatNeedRichElementsForReflection = listOf("create_calendar_event", "bulk_update_calendar_events")
val shouldPreferRichElements = if (call.toolName in toolsThatNeedRichElementsForReflection) {
    println("DEBUG: Forcing preferRichElements=true for ${call.toolName} (needed for reflection)")
    true
} else { ... }

Tool coverage is limited. Only six tools trigger reflection. Other agentic actions are not validated.

The reflection model can still be wrong. The programmatic date facts and post-response filtering mitigate this for day-of-week arithmetic, but they do not cover all reasoning. Other errors in the reflection model's output are not independently verified.

Latency cost. Every turn that touches a calendar or location tool pays for a second LLM call. This is the deliberate trade-off: correctness over speed for high-stakes actions.

11. Conclusion

Reflection adds a structured error-correction loop to the agent's action pipeline. By separating the "do" step from the "check" step, and by giving the second model verified ground truth to work from, the system can catch and correct a class of subtle agentic errors before they reach the user. The rollback mechanism makes this recovery lossless: a bad calendar event is deleted or restored before any correction is attempted, so the user never sees the wrong outcome.

The severity model (ERROR vs. WARNING) and the programmatic date facts ensure that reflection does not over-correct—it rolls back only when the outcome is wrong, and it does not contradict deterministic date arithmetic. The correction message gives the main agent clear, actionable feedback for the retry. The result is an agent that can safely take real-world actions, with a second pair of eyes catching the mistakes that would otherwise slip through.