Engineering · 2026-10-28

Why Camiel does not have a chat interface

Camiel — our privacy-focused answering engine — is built around a search box and a permalinked answer page, not a chat. The argument for that choice is technical, not aesthetic, and it falls out of what verifiability actually requires.

Camiel is a privacy-focused answering engine. The user types a question; the system retrieves passages, synthesises an answer, displays the answer with the cited passages alongside. There is a search box and a result page. There is no chat box. There is no follow-up question button. There is no continuous conversational state across queries.

We are asked, often, to add a chat interface. The asks come from users (chat is what they expect from an AI product), from product collaborators (chat is what the category looks like), and occasionally from investors (chat is how the category is framed). We have not added one, and we do not plan to. The reasoning is technical, not aesthetic.

A verified-answer system needs every query to be reproducible from the query alone. A chat interface implies that the answer to the third question depends on the first two. If a user asks "who funds them?", the them must be in the question or the question is unanswerable. In a chat, them is in the model's context, and the answer is correct if the model's reference resolution happens to be correct. This is a perfectly normal property of a conversational system. It is the wrong property for a system whose value proposition is the answer is verifiable, the citations are real, the claim is supported by the source. Verifiability requires that the input to which the answer is bound is the input the user can read. Hidden context, for an answering engine, is the wrong shape.

Citations don't compose well across turns. A single-question answer with citations is a stable artifact: the citation set is what supports the claim, and the user can check each one. A multi-turn answer with citations is unstable in a specific way: by turn three, the model has made claims that combine information from cited passages in turn one, citations from turn two, and its own internal synthesis. Tracking which citation supports which claim across turns becomes either a UI engineering problem the team will not actually solve well (most chat interfaces don't), or a verbose experience that displays every citation from every prior turn next to every new claim (which makes the interface unreadable). Both directions degrade the per-answer fidelity that is the product's reason to exist.

Chat invites lazy queries; the search box rewards careful ones. This is the part of the argument we are least sure of in the abstract, and the part where direct user testing keeps confirming the suspicion. A search box prompts the user to phrase the question in full; a chat box prompts the user to type something terse and let the system fill in the rest. For a system whose retrieval quality depends on a precise query, the affordance matters more than it might in a system whose value comes from reasoning over context the model already has. The interface should match the workload, not the genre.

The most useful evaluation unit is the per-query answer. Internally, when we evaluate Camiel against benchmarks (or against ourselves over time), we evaluate on a set of (query, answer, citations) triples. Each triple is independently scorable: did the citations support the claims, were the citations real, did the answer hedge appropriately on uncertainty. Conversation muddies this. A conversation-shaped evaluation is harder to construct, harder to score, harder to compare across systems, and harder to use as a regression-detection tool when the underlying retrieval or synthesis layer changes. The decision to keep the unit-of-evaluation small was upstream of the decision to keep the unit-of-interaction small.

What we have instead of chat:

A permalink per query. Every (query, answer, citations) triple has its own URL. Users share the URL, return to it, link to it from other places. The query is in the URL; the answer is reproducible from the query.
An explicit new query path. Users can refine — but the refinement is a new query that they type in full, with the new context made explicit. We found, in user testing, that prompting users to type the question you actually have, in full produced better questions than prompting them to ask a follow-up.
No memory across queries by default. A signed-in user's queries are logged for that user's own retrieval, not as context for the model. The model sees the current query and only the current query.

What we tried that did not work:

A "refine your query" affordance that took the previous query and the new input as joint input to a query-rewriting step. Users used it as a chat interface. The rewriting step was a quiet implicit chat. We removed it and the surface improved.
Suggested follow-up questions below the answer. They produced engagement, which we initially read as a positive signal. Closer inspection suggested the suggestions were directing users toward queries the system answered well, rather than toward questions the users actually had — which is the wrong gradient to lean on.

The condition under which we would reconsider: if the dominant entry point to the engine becomes voice. Voice interfaces have very different ergonomics — typing a precise query is hard with the voice, where it is natural with the keyboard — and the conventions users bring to a voice interface are conversational. If voice traffic eventually dominates, the per-query model will need a voice-shaped variant, and that variant may look more like chat than the current one does. We are watching the share of voice traffic. It has not yet crossed the threshold where the keyboard-first design is the wrong default.

A short note on what this is not an argument against. We are not arguing that chat is the wrong interface for AI products in general. It is the right interface for many things — open-ended assistance, code editing, drafting, agentic workflows. Camiel is none of those. It is an answering engine whose value is the per-question fidelity of its answers. The interface should match the product, not the category.

← All engineering posts