Roadmap
AI Workbench is built in small, shippable phases. Each phase produces a runnable artifact and a stable slice of the HTTP contract.
Status snapshot
| Phase | Scope | Status |
|---|---|---|
| 0 | Runtime bootstrap + docs | ✅ Shipped |
| 1a | Control-plane CRUD (/api/v1/workspaces, /catalogs, /vector-stores) | ✅ Shipped (later refactored — see Phase KB) |
| 1b | Vector-store data plane (provisioning, upsert, search) | ✅ Shipped (later refactored — see Phase KB) |
| 2a | Document metadata CRUD (/catalogs/{c}/documents) | ✅ Shipped (later refactored — see Phase KB) |
| 2b | Ingest + catalog-scoped search + saved queries + cross-replica jobs + adopt + document chunks/delete cascade | ✅ Shipped (saved queries / adopt retired in Phase KB) |
| 2c | Server-side embedding (Astra $vectorize) for search + upsert | ✅ Shipped |
| 3 | Playground + UI | ✅ Shipped |
| Auth | Middleware, API keys, OIDC verifier, browser login, silent refresh | ✅ Shipped (1–3c); 4 (RBAC) planned |
| KB | Catalogs + vector-store descriptors → knowledge bases + chunking/embedding/reranking services | ✅ Shipped |
| Chat-1 | Workspace-level Chat with Bobbie page (UI scaffold) | ✅ Shipped |
| Chat-2 | Persistence — agentic tables wired through memory/file/astra | ✅ Shipped |
| Chat-3 | Chat + message CRUD routes, functional UI | ✅ Shipped |
| Chat-4 | HuggingFace chat completion + multi-KB RAG (sync) | ✅ Shipped (HuggingFace retired in 0.3.0 — see Chat-6) |
| Chat-5 | SSE token streaming end-to-end | ✅ Shipped |
| MCP | Model Context Protocol façade — workspace as MCP tools | ✅ Shipped |
| Agents-1 | Agent + conversation CRUD over the agentic tables | ✅ Shipped |
| Agents-2 | Agent send + streaming pipeline (generalize chat-5 to any agent) + LLM-services CRUD + Bobbie retirement | ✅ Shipped |
| Chat-6 | Retire HuggingFace; unify chat + embeddings on OpenAI-compatibility — OpenRouter (hosted default), direct OpenAI (BYOK), Ollama (offline); live tool-calling model catalog | ✅ Shipped (0.3.0) |
| Agents-3 | Agent calls MCP tools (client-side) — connectMcpClient + remoteMcpTools + dispatcher | ✅ Shipped (0.4.0; 0.5.0 hardening — see "Next steps") |
| 7+ | Native non-OpenAI-compatible providers, polish | Planned (see "Next steps") |
Phase 0 — Bootstrap ✅
Shipped with the initial runtime scaffold.
runtimes/typescript/src/root.ts— Hono-based HTTP entry point.- Config loader that reads
workbench.yaml, interpolates env vars, validates the v1 schema. - Operational endpoints:
GET /,/healthz,/readyz,/version,/docs. - Dockerfile producing a single image on port 8080.
- CI: lint, typecheck, unit tests, Docker build.
Phase 1a — Control-plane CRUD ✅
Shipped across PRs #4, #5, #6, #7, #8.
ControlPlaneStoreinterface with three backends:memory,file,astra.- Full
/api/v1/*CRUD for workspaces, catalogs, and vector-store descriptors. - Astra backend talks to Data API Tables via
@datastax/astra-db-ts— no wrapper libraries in between. SecretResolverwithenv:andfile:providers.- The multi-runtime "green box" model: default TypeScript runtime at
runtimes/typescript/, alternative runtimes as siblings underruntimes/. - Python runtime scaffold (FastAPI) under
runtimes/python/. - Cross-runtime conformance harness with committed fixtures.
Phase 1b — Vector-store data plane ✅
Vectors are now first-class. Descriptors still manage metadata; the actual data lives in a per-workspace backend (in-memory for mock, Data API Collections for astra).
VectorStoreDriverinterface coveringcreateCollection,dropCollection,upsert,deleteRecord,search.- Two drivers registered today:
MockVectorStoreDriver— in-memory, cosine/dot/euclidean math built in, used by CI and workspaces withkind: "mock".AstraVectorStoreDriver— backed by Astra Data API Collections via@datastax/astra-db-ts. Per-workspaceDataAPIClientcache.
POST /api/v1/workspaces/{w}/vector-storesis transactional — descriptor row + collection, with rollback on provisioning failure.DELETEdrops the collection then the descriptor.- New routes:
POST .../records— batch upsert (1..500 per call)DELETE .../records/{id}— single deletePOST .../search— vector + shallow-equal payload filter
- Shared driver contract suite runs against mock and against a fake Astra
Db(faithful enough for cosine-ordering assertions). Real Astra integration — gated behindASTRA_DB_*env vars — ships with Phase 2 when we have actual ingest flows to exercise it. - Capability flags (lexical, rerank, hybrid) and Astra collection creation options (embedding service, source model) remain on the Phase 2+ shortlist.
Phase 2a — Document metadata CRUD ✅
Shipped with the documents HTTP surface.
GET|POST /api/v1/workspaces/{w}/catalogs/{c}/documentsandGET|PATCH|DELETE .../documents/{d}on the canonical TypeScript runtime.- Backed by the already-existing
ControlPlaneStore.*Documentmethods across all three backends (memory, file, astra). - Cross-catalog isolation enforced: a document registered under catalog A is invisible under catalog B in the same workspace (
404 document_not_found). DELETE /catalogs/{c}cascade — already implemented in every backend — is now documented inapi-spec.md.- New conformance scenario
document-crud-basic; fixture committed. - The Python runtime still returns
501 NotImplementedfor documents and will close that gap separately (different owner).
Phase 2b — Ingest, search, queries
Goal: end-to-end knowledge-base flow from raw file to searchable result.
Shipped in this phase so far:
- Embedding seam.
Embedder/EmbedderFactorylanded in Phase 3 for the Playground; reused verbatim by the ingest pipeline — no new contract needed. - Chunking seam.
Chunkercontract atruntimes/typescript/src/ingest/chunker.tsplus a referenceRecursiveCharacterChunkerimpl. Char-based, respects natural text boundaries (\n\n,\n,.,?,!,), overlap-aware, with a shared contract suite (tests/ingest/chunker-contract.ts) that any future chunker must pass. POST .../catalogs/{c}/documents/search— catalog-scoped search that delegates to the catalog's bound vector store. MergescatalogId = catalog.workspaceIdinto the filter so a search cannot escape its catalog. Covered by scenariocatalog-scoped-document-search.- Hybrid + rerank lanes on the search route. Driver contract extended with optional
searchHybridandrerankmethods; mock driver implements both with a cheap tokenizer + min-max normalization. Request body gainshybrid,lexicalWeight, andrerankflags; descriptor-levellexical.enabled/reranking.enabledfeed the defaults. Drivers that lack either method return 501 (hybrid_not_supported/rerank_not_supported). - Astra-native hybrid + rerank.
searchHybridon the Astra driver usesfindAndRerank(astra-db-ts's native hybrid API): vector + lexical + reranker combined in one call. Requires the descriptor to opt into bothlexical.enabledandreranking.enabled;createCollectionpasses those options to the Data API so the collection is provisioned with a lexical index + reranker service. Standalonererankstays unimplemented on Astra (astra-db-ts doesn't expose that primitive); callers sethybrid: trueto get the combined path.lexicalWeightis a no-op on Astra — the reranker owns the blend. POST .../catalogs/{c}/ingest— synchronous end-to-end ingest. Chunks the input text, embeds each chunk (server-side via$vectorizewhen supported, otherwise client-side), upserts into the catalog's bound store, and creates aDocumentrow withstatus: ready. Failures mark the rowfailedwitherrorMessagebefore re-raising. Chunk payloads carrycatalogId,documentId,chunkIndex, plus caller metadata. Covered by scenariocatalog-ingest-basic.POST .../catalogs/{c}/ingest?async=true— same pipeline, returned to the caller as a 202 with ajobpointer. Background worker updates the job record (processed,total,status,errorMessage) through aJobStore.GET .../jobs/{jobId}polls;GET .../jobs/{jobId}/eventsstreams updates via SSE, closing on terminal states. In-flight jobs don't resume across restart (the pipeline's owning worker is gone); durable stores still keep the record around for the operator. Not in conformance (timing-dependent); covered by TypeScript runtime tests.- Durable
JobStorebackends. File (<root>/jobs.json) and Astra (wb_jobs_by_workspace) impls auto-matched tocontrolPlane.driver. Memory stays the default for ephemeral runs. Shared contract suite (tests/jobs/contract.ts) runs the same assertions against each backend. - Cross-replica subscription fan-out. The Astra job store polls subscribed records (default 500ms, tunable via
controlPlane.jobPollIntervalMs) so an SSE client connected to replica B sees updates that landed on replica A. Same-replica updates still fire instantly through the in-process listener registry — the poller only catches the cross-replica case and is a no-op when no one is subscribed. - Lease + heartbeat on running jobs. Workers stamp
leasedByleasedAtwhen they pick a job up and refresh on every progress tick, so a stalled worker is detectable.
- Orphan sweeper. Off by default; clustered deployments opt in via
controlPlane.jobsResume. When on, every replica scans forrunningjobs whose lease is older thangraceMsand CAS-claims them. - Pipeline resume after orphan reclaim. Async-ingest jobs persist an
IngestInputSnapshotalongside the job record (theingest_input_jsoncolumn onwb_jobs_by_workspace). When the sweeper claims an orphan, it replays the original ingest through the sharedrunIngestJobworker — chunk IDs are deterministic (${documentId}:${chunk.index}) so the upsert is idempotent. Wasted embedding cost on the second pass, correct final state. Older jobs without a snapshot, and any future non-ingestkinds, fall back to the original mark-failed path. - Saved queries —
/api/v1/workspaces/{w}/catalogs/{c}/queriesCRUD +POST /{q}/runthat replays through the catalog-scoped search path. Text-only; the/runendpoint merges the catalog's ID into the filter unconditionally so saved queries cannot escape their catalog. New control-plane table (wb_saved_queries_by_catalogon astra); cascades on workspace/catalog delete. Covered by scenariocatalog-saved-queries. - Adopt existing collections —
GET /vector-stores/discoverable+POST /vector-stores/adopt. Operators with collections that already exist in their Astra DB (created by another tool, by hand, or by an older workbench install whose state was wiped) can wrap them in a workbench descriptor without re-provisioning. The driver'slistAdoptable(workspace)reads the live collection's vector / lexical / rerank options off the data plane; the adopt route stamps a descriptor mirroring them. - Document chunks listing + delete cascade —
GET /catalogs/{c}/documents/{d}/chunksreturns the chunks under one document (id, chunkIndex, text, payload). The ingest pipeline stamps a reservedchunkTextpayload key so the text is always retrievable through the new endpoint, regardless of whether$vectorizewas used.DELETE /catalogs/{c}/documents/{d}now cascades into the bound vector store via the new driver methoddeleteRecords(ctx, filter)so chunks no longer orphan when a document is removed.
Phase 2b is closed.
Workspace-scoped API keys moved into their own dedicated auth track — see auth.md for the phased rollout.
Phase 2c — Server-side embedding (Astra vectorize) ✅
Astra Data API collections created under this runtime opt into server-side embedding when the descriptor's embedding names a supported provider (openai, azureOpenAI, cohere, jinaAI, mistral, nvidia, voyageAI). The driver:
- Passes
vector.service: { provider, modelName }atcreateCollection. - Routes
search(text)viafind(sort: { $vectorize: text })insearchByText. - Routes
upsert([{text}])viainsertMany({ $vectorize, ... })inupsertByText. - Attaches the resolved embedding API key as
x-embedding-api-keyper request (header auth, not Astra KMS).
Legacy collections without a service block raise COLLECTION_VECTORIZE_NOT_CONFIGURED; the driver catches and rethrows as NotSupportedError, after which the route layer falls back to client-side embedding via LangChain JS. No migration required on existing data. See docs/playground.md for the dispatch model.
Phase 3 — Playground & UI ✅
Browser UI for exploring workspaces, managing their vector stores, and running searches against them.
Shipped:
/— workspace list + onboarding wizard./workspaces/{workspaceId}— detail, test-connection, vector-store CRUD panel, API-key issue/revoke panel./playground— ad-hoc vector + text queries with expandable results. Seedocs/playground.md.- Playground API: text queries via an extension of the existing
POST .../searchroute (accepts either{ vector }or{ text }— no new endpoint). Upsert followed the same pattern for text records. - UI consumes the existing
/api/v1/*surface — no special admin API. - UI + default TS runtime ship as one Docker image — the image builds
apps/webin a first stage and serves it out of/app/public. Seeruntimes/typescript/Dockerfileanddocs/configuration.md'sruntime.uiDir.
Subsequently shipped under Phase 2b (and surfaced through the workspace UI rather than the playground itself):
- Ingest UI — file upload + paste-text dialog under Workspace → Catalogs, sync and async (SSE-streamed progress).
- Catalog/document browsing — Catalogs panel with per-catalog document list on the workspace detail page.
- Saved queries — catalog-scoped CRUD + run, with a panel under the workspace UI. The playground itself remains a stateless scratchpad by design.
Phase KB — Knowledge bases & execution services ✅
Refactored the catalog / vector-store / saved-query model into a single first-class concept: the knowledge base. A KB owns its Astra collection end-to-end and binds the chunking + embedding + (optional) reranking services that produce its content.
Shipped:
- Knowledge bases. New
wb_config_knowledge_bases_by_workspacetable. KB create transactionally provisions the underlying collection through the workspace's driver, using the KBnameas the owned collection identifier and the bound embedding service to determine vector dimensions and similarity. KB delete drops owned collections and cascades RAG documents; attached KBs detach without dropping external collections. - Execution services. Three new tables —
wb_config_chunking_service_by_workspace,wb_config_embedding_service_by_workspace,wb_config_reranking_service_by_workspace. Multiple KBs may share a service definition; deleting an in-use embedding / chunking service is blocked with409 conflict. - Service immutability for vector-determining bindings. A KB's
embeddingServiceIdandchunkingServiceIdare pinned at create time (the collection's dimensions follow the embedding service);rerankingServiceIdstays mutable. resolveKbsynthesis layer. Existing driver / dispatch / ingest code keeps a vector-store-shaped descriptor view by resolving a KB + its bound services on demand, so the data-plane surface stayed stable across the refactor.- Routes. All catalog / vector-store / saved-query routes retired in favor of:
/api/v1/workspaces/{w}/{chunking,embedding,reranking}-services/api/v1/workspaces/{w}/knowledge-bases[/{kb}].../knowledge-bases/{kb}/{records,search,documents,ingest}
- UI. Catalogs panel + vector-stores panel removed; replaced with
KnowledgeBasesPanelandServicesPanel. Playground now picks a KB rather than a vector-store descriptor.
Saved queries and the adopt-existing-collection flow were retired in this phase — the new shape doesn't need them, and re-adding either would land cleaner under the new model than as a port.
Chat phases ✅ — retired in favor of the agent surface
The /chats route surface (Chat-1 through Chat-5) shipped first as a singleton-Bobbie HTTP layer over the agentic tables. With Agents-2 the chat send + streaming pipeline was generalised to any user-defined agent and the /chats route was deleted; the chat UI now talks directly to the agent endpoints. Existing data on the agentic tables (originally written under the Bobbie row) is untouched and continues to work as ordinary agent records. See agents.md for the current shape.
The historical phase breakdown is preserved below for context — it maps directly onto the agent surface today.
- Chat-1. Workspace-level chat page scaffold with placeholder UI; route + navigation entry from the workspace detail page.
- Chat-2. Persistence layer. Stage-2 agentic tables (
wb_agentic_agents_by_workspace,wb_agentic_conversations_by_agent,wb_agentic_messages_by_conversation) wired through all three control-plane backends. Cascade behavior covers workspace delete, KB delete (kb-id stripped from any conversation'sknowledge_base_idsset), and conversation delete. - Chat-3. CRUD HTTP surface plus a functional ChatPage with sidebar conversation list, composer, and URL-driven conversation selection.
- Chat-4. HuggingFace integration. New
chat/module withChatService,HuggingFaceChatService(over@huggingface/inference'schatCompletion), prompt-assembly, and multi-KB retrieval. Optionalchat:block inworkbench.yaml; without it (and without an agent-bound LLM service), agent send returns503 chat_disabled. - Chat-5. SSE token streaming. The stream emits
user-message, then onetokenevent per delta, then a terminaldone/errorevent carrying the persisted assistant row. The browser uses fetch streaming (notEventSource) since the request isPOSTwith a JSON body. Cancel button wires theAbortSignalthrough to the HF stream so the runtime stops paying for tokens nobody will see.
MCP phase ✅ — Model Context Protocol façade
Workspace-scoped MCP server mounted at /api/v1/workspaces/{w}/mcp. See mcp.md for the walkthrough.
- Streamable HTTP transport (modern MCP). Stateless: each request constructs a fresh server, no session-id tracking.
- Tools (read-mostly):
list_knowledge_bases,list_documents,search_kb(vector / hybrid / rerank),list_chats,list_chat_messages. Pluschat_send(opt-in viamcp.exposeChat: true) which routes through the runtime's global chat service. - Auth. Reuses the shared workspace-route authz wrapper, so a scoped workspace API key for workspace A cannot call MCP tools on workspace B.
- On by default.
mcp.enabled: falseopts out.
Tools deliberately don't include write operations (ingest, KB CRUD, workspace CRUD) yet — see "Next steps" for when to add them.
Followups deferred:
- stdio transport (
npx ai-workbench-mcp) for local Claude / IDE integrations that don't want to round-trip through HTTP. - Per-tool auth scopes so write tools can be enabled per-API-key.
- MCP resources (vs tools) — currently every read is a tool call. Some clients prefer resources for read-only data.
Next steps (not yet started)
The phases below are sequenced loosely; each is independently shippable so reordering doesn't burn earlier work.
User-defined agents (Agents-1 ✅, Agents-2 ✅)
Agents-1 shipped the CRUD surface — agent + conversation primitives, with the chat surface still talking to its own deterministic singleton row. Agents-2 generalised the chat-5 send + streaming pipeline to any agent, wired wb_config_llm_service_by_workspace end-to-end as /api/v1/workspaces/{w}/llm-services (workspace-scoped CRUD), and retired the /chats route + Bobbie singleton entirely. See agents.md for the current shape, including the agent.llmServiceId resolution order.
Remaining open work in this area:
- Multi-provider chat. Per-agent LLM services wire
provider: "openrouter","openai", and"ollama"end-to-end as of 0.3.0 — all three share one OpenAI-compatible adapter, so OpenRouter alone already reaches 300+ models. Other stored providers (e.g. a direct Cohere or Anthropic SDK case) return422 llm_provider_unsupporteduntil the dispatcher grows a case for them; since OpenRouter fronts those vendors, that's now a niche need. - Tool execution via MCP ✅ (shipped 0.4.0). Agents call remote MCP tools through the same SDK on the client side:
connectMcpClient(chat/tools/mcp-client.ts, SSRF-guarded StreamableHTTP), theremoteMcpToolsprovider (chat/tools/providers/remote-mcp.ts, per-server discovery +mcp:{serverId}:{tool}namespacing), and the unified dispatcher (chat/tools/dispatcher.ts) are all wired and tested. 0.5.0 is hardening + access control, not greenfield: save-timetoolIdvalidation, atools:invokeper-call scope gate, a TTL discovery cache, and a server-grouped picker. The persistedwb_config_mcp_tools_by_workspacetable is deliberately deferred — a per-(workspace, server, url, credentialRef)TTL cache (chat/tools/mcp-discovery-cache.ts) kills per-turn connect latency without a new table × 4 backends; revisit a persisted catalog only if cross-replica warm-start becomes a need.
Per-KB / per-agent rate limiting
Chat costs LLM provider tokens; today the runtime relies on the global /api/v1/* IP-based limiter. Per-workspace and per-chat token buckets would let operators bound spend without blocking other endpoints.
Markdown rendering + citation linkbacks ✅
Shipped. The assistant bubble renders sanitized GitHub-flavored markdown via react-markdown + remark-gfm + rehype-sanitize. Inline [chunkId] citations rewrite into deep links that auto-open the cited document's detail dialog in the KB explorer and scroll the matching chunk into view. The runtime persists the chunk → (KB, document) map on each assistant turn at metadata.context_chunks (JSON-encoded compact tuples), so the UI doesn't need a follow-up fetch.
Multi-provider chat backends
ChatService is provider-agnostic, and the runtime ships one OpenAI-compatible implementation that serves OpenRouter, direct OpenAI, and a local Ollama server (HuggingFace was retired in 0.3.0). Adding a native CohereChatService or Anthropic-backed service is mostly mechanical — the prompt assembler and route are unchanged — but OpenRouter already fronts those vendors, so it's low priority.
Production-grade chat persistence
The agentic tables are write-heavy under streaming load (one row per assistant turn, plus the user turn before it). Astra handles that fine, but the file backend writes the whole messages.json on every append. Worth either:
- A per-chat append-only log file, or
- A SQLite-backed
filedriver variant for chat-heavy deployments.
Conformance for the chat surface
Every other route surface has a cross-runtime conformance fixture. Chat doesn't yet — partly because the streaming SSE shape is trickier to fixture, partly because there's no Python runtime implementation yet. Add a fixture set covering the CRUD surface (easy) and the SSE happy path with a deterministic fake provider (less easy but worth it before a second runtime).
Cross-cutting workstreams
These run continuously rather than as discrete phases:
- Observability. Structured logs with
workspaceId, request IDs, and OpenTelemetry traces. Both shipped — pino logs out of the box, manual server spans through@opentelemetry/api(no-op without an SDK), and a NodeSDK + auto-instrumentations bundle behindruntime.tracing.enabled: true. Prometheus exposition at/metrics. Seeproduction.md. - Conformance. Every route added lands with a scenario and regenerated fixtures. Every language runtime updates in the same PR. Enforced by the drift-guard test.
- Docs. Every route addition updates
api-spec.mdin the same PR. The generated OpenAPI at/api/v1/openapi.jsonis always in sync with the running runtime. - Polyglot runtimes. Each language green box that gets taken out of scaffold status adds a row to the "current runtimes" table in
green-boxes.md.
Open questions
Things we have deliberately not decided and should revisit before the corresponding phase:
- Multi-tenant auth model. Is a workspace the tenant, or is there a tenant-above-workspace concept for SaaS deployments?
- Secrets backends.
envandfileproviders are fine for single-node self-hosted. Hosted deployments likely want pluggable providers (Vault, AWS Secrets Manager, etc.).SecretProvideralready supports this. - Chunker/embedder plugin model. In-process only, external HTTP contract, or both?
- Hot reload. Worth the complexity, or is restart-on-change sufficient? (Leaning restart-only — the blast radius of config changes is small now that workspaces are runtime data.)
- Schema version 2. What changes are we queueing that would force a bump, and how do we stage it?