Elevator pitch
What are you proposing to change?For v2 of the protocol wire format, I am proposing a change in the lifecycle of the prompt request, allowing for more dynamic session updates from the agent, and unlocking new capabilities in the process. Once a session is created, the agent will be able to send session updates at any point in time, and prompt requests will last until the prompt is accepted, not until the end of the turn. As I’ll go into later, this not only removes some current awkwardness around the prompt request lifecycle, but also provides a more flexible foundation to add features like queued messages and multi-client replay. This can even allow the agent to initiate an interaction in a session rather than requiring it to wait for a user prompt, which is becoming increasingly important for background tasks and agents which may send updates before or after a “turn” is over since its runtime might be different than the main conversation.
Status quo
How do things work today and what problems does this cause? Why would we change things?Currently, the protocol kind of assumes that all turns will be initiated by a client and ended by an agent, with a series of session update notifications in-between. While in many cases this is enough, it is becoming clear that this model is not flexible enough. It is not clear how to model queued messages for instance: would these create a new turn request lifecycle? Or fit into the existing one? What if the agent wants to submit some text at the start of a session before the user prompts? Or a status update? Also, if an agent finishes it’s turn, wants to wait for the next user action, but had a background subagent or task running, can it only submit updates about that status after the user prompts again? When replaying a session, the prompt request can be turned into a user message notification, but what about the end of turn response? If you call load during a currently running session, how do you know that the turn is done? Some clients handle these out-of-turn updates more gracefully than others. But it is a constant point of confusion in discussions and issues. In the spirit of allowing as much flexibility in the protocol for new paradigms and designs to emerge in the prompt lifecycle, I think imposing fewer restrictions in the protocol, whether explicitly described or just implicitly inferred because of vague wording, on when participants can make session updates will allow for more dynamic sessions, as well as make it easier to extend to new use cases in the future.
What we propose to do about it
What are you proposing to improve the situation?
Change the session/prompt response
session/prompt is still a request, but its response lifecycle will change.
The agent will respond once the prompt has been accepted, not when the turn is over. Message IDs do not need to be returned from the prompt response; per the message id RFD, the agent remains the source of truth for message IDs and provides them through the session update stream when it emits the accepted user message.
Additional Agent session/update notification types
Because session/updates can more freely flow from the agent, and we lost the ability to pass end_turn and other information from the prompt response, we need to provide the agent with the affordance for a few more notification types.
User message accepted/acknowledged
In order to have a consistent understanding between agent and client on where the user message appears within the session history in relation to other messages, it is important to see when and where the agent has accepted the user message into the feed. This will also be important for queueing messages, depending on how we implement that, so that the client can know if it is still allowed to edit the queued message, or where in the turn order it got inserted. Even without a new queue, which may allow for editing the queued message, it means that the client doesn’t necessarily have to send asession/cancel before prompting. This would need some exploration, but potentially the agent could decide whether it cancels the current turn and inserts it immediately, or inserts it at the next convenient break point. This should probably still be defined as “as soon as possible” and queueing would enable some later points, but it could still be more graceful than needing to cancel all current tool calls for example, as is required at the moment.
The question then turns to what makes up this notification. Which brings us to:
Who owns the user message id?
The message id RFD proposes that the agent owns message IDs. This means the client cannot eagerly create a protocol message ID for a prompt, but it also means we avoid a shared uniqueness or UUID requirement between clients and agents.
By allowing the agent to replay or acknowledge the user message as a session update, we have a natural place for the agent to provide both the content and the ID once it is inserted in the session. Ultimately the agent is responsible for session persistence. If there is only one source for IDs, we can continue to treat them as opaque strings that fit well into each agent’s implementation.
My current proposal is that this would look like the client sending the following message:
messageId in that notification gives the client the ID for future message-specific operations.
This is a new message type as well. Not a user_message_chunk but just a user_message that allows for sending the entire message at once. For v2 we will make sure to allow for the agent to do the same on their messages as well, providing both full and partial streaming update patterns for a given message.
state_change notification
This would be a notification from the agent to indicate that it’s current status has changed, such as the “turn” has ended, carrying information like stopReason and usage data for that turn.
Running, to indicate that a turn has begun. Important now that turns aren’t tied necessarily to prompts:
Shiny future
How will things will play out once this feature exists?This isn’t a huge schema change, but it is a fundamental behavior change in the protocol that I believe:
- Provides agents with much more flexibility in how they want to update a client about a given session
- Solves some concrete pain points we have the the current model (i.e. how to integrate prompts into session/load and multi-client replays, message ids, etc)
Implementation details and plan
Tell me more about your implementation. What is your detailed implementation plan?Overall, this isn’t a huge lift on schema definition, but it is a large, breaking change in behavior which means we can only stabilize in protocol version 2. Depending on how the rest of v2 testing goes, we can either:
- Make this an opt-in “future-flag” capability on v1 so people can experiment, but it would be an unstable feature regardless.
- We establish a preview/beta flow for v2
_meta flag.
Frequently asked questions
What questions have arisen over the course of authoring this document or during subsequent discussions?I’ve hopefully addressed all of the questions and concerns for what motivated this above, but happy to engage with others on this.
What alternative approaches did you consider, and why did you settle on this one?
Prompt as a notification
Early discussions revolved around having this be a bidirectional stream of notifications on the session. While this felt very symmetrical and appealing, it ran into several problems in practice:- Clients only really had one type of notification that made sense to emit on the session: user messages
- The Agent would still need to replay that message to show where it got accepted within the message history
- We would then need a notification-based way of emitting errors for invalid prompts that would need to be tied to fire-and-forget notifications.