Pick the right model
Honest advice — premium isn't always better. The right answer is often 'the cheap one, but check the results.'
The two tiers
| Tier | Models | Best for | Cost relative |
|---|---|---|---|
| Standard | Mistral Small · GPT-4o mini · Claude Haiku · Gemini Flash | Bulk Q&A. Questionnaire row-by-row. Handbook chat. Short summaries. | 1× |
| Premium | Mistral Large · GPT-4o · Claude Sonnet · Gemini Pro | High-stakes drafting. Long-context reasoning. Strict instruction-following. Anything customer-facing. | 5-15× |
Try the cheap one first. For most RAG tasks (the answer is in the corpus, the model just needs to find and rephrase it), standard-tier models perform indistinguishably from premium. Always start cheap, measure quality on 20-30 real questions, only upgrade if there's a measurable gap.
When each tier is the right choice
Use a standard-tier model when
- Volume matters. You're filling 200 rows of a security questionnaire and cost adds up linearly. A 10× cheaper model that's "almost as good" is the right call.
- The corpus does the heavy lifting. If the answer is verbatim in a single chunk and the model just needs to extract and lightly rephrase, even Mistral Small is perfectly capable.
- You're iterating. While you're tuning the system prompt and retrieving the right chunks, every test costs money. Stay cheap until the structure is right; consider upgrading the model at the end if you need to.
- Latency matters. Standard models respond in 1-3 seconds; premium can be 5-15 seconds. For interactive chat surfaces this is felt.
Use a premium model when
- The output is customer-facing. An RFP response that goes back to a real prospect deserves the better model — the marginal cost is trivial against the deal value.
- The task requires synthesising across multiple chunks. "Compare our SOC 2 and ISO controls and tell me where they overlap" needs reasoning that cheap models can fumble.
- Strict JSON schema compliance matters. Premium models are more likely to return well-formed JSON on the first try, which matters when an Excel column needs to be populated and a parse failure breaks the row.
- Long context is involved. If you've cranked
top_kto 15+ chunks, premium models handle the long-context reasoning more reliably.
A simple A/B procedure
Pick a model based on intuition is fine. Pick a model based on 20 real questions is better:
- Make two copies of your agent — one with each model. Same system prompt, same scope, same top_k, same temperature.
- Run the same 20-30 questions through both. Use realistic questions, not softballs.
- Score each answer 1-5 on the two dimensions you care about — usually accuracy (does it match the source?) and style (is it usable as-is, or does it need editing?).
- If premium isn't materially better, stay with standard. Most of the time it isn't.
Provider differences (rough characterisations)
All four providers are good. They tend to be different in flavour:
- Mistral — strong at French, German, Italian, Spanish. Mistral Small is one of the best value-for-money standard-tier models.
- OpenAI (GPT-4o family) — excellent instruction-following. Reliable structured output. The safe default in many enterprise environments.
- Anthropic (Claude) — best-in-class grounding fidelity. Strongly tends to say "I don't see that in the sources" instead of hallucinating. Good first pick for security and legal agents.
- Google (Gemini) — strong reasoning over long contexts. Gemini Flash is very fast. Pricing competitive on volume.
Changing the model on an existing agent
Open the agent → Edit → Model dropdown → save. Existing keys keep working — the very next call uses the new model. Useful for A/B tests: clone the agent, change only the model, run both for a week, kill the loser.
Cost & quota visibility
Every generation (browser, Word, Excel, API) counts against your monthly cap. See Settings → Plan & billing for current usage. Each agent's call log shows the token count and provider, so you can see which agents are eating the budget.
If you're approaching the cap, the first place to look is high-temperature long-output agents — drop max_tokens or temperature before upgrading plan.