TL;DR — Abstract the LLM behind a single function call from day one. The wrapper is half a day of work; the cost of not having it is months when the time comes to switch providers. Differences in tool-use, streaming format and system-prompt behaviour bite first.
The LLM your product uses today will not be the LLM it uses in two years. Pricing shifts, capability gaps open and close, vendor risk shows up, and sometimes a model retires. Treating the model as a hard dependency is one of the most expensive design mistakes we see — and one of the easiest to avoid.
This is what we learned standing up Blink AI on multiple frontier models and switching between them without users noticing.
Why teams stay locked in
- Tool-use / function-calling tied to one SDK’s JSON shape
- System prompts tuned to a specific model’s instruction-following style
- Streaming formats hard-wired into the frontend
- Cost monitoring tied to one provider’s billing API
- Evaluation set never re-run on alternatives
The abstraction we ship
One internal function, callLLM(messages, options), hides the provider. Inside it:
- A canonical message shape (role + content + optional tool calls) that any provider can be adapted to
- A tool-call adapter that translates between OpenAI function-call JSON and Anthropic tool-use blocks
- A streaming normalizer that emits a common chunk type regardless of provider
- Per-provider retry, timeout and rate-limit policies
- Cost telemetry recorded against a unified token model
The rest of the application never imports an LLM SDK. Swapping a provider is a config change, not a code change.
Differences that bite
| Surface | OpenAI / GPT | Anthropic / Claude |
|---|---|---|
| Tool-use schema | JSON Schema, arguments stringified | JSON Schema, structured tool_use blocks |
| System prompt | One message at the top | Dedicated system param, often followed more literally |
| Streaming | SSE chunks with delta | Event stream with content_block_delta |
| Refusal style | Apologetic, often partial | Direct, often with reasoning |
| Long-context behaviour | Strong at recall, weaker at synthesis | Strong at synthesis, careful with sources |
What to A/B test before cutting over
- Replay your evaluation set against the new provider. Don’t trust public benchmarks — they don’t reflect your product.
- Shadow-traffic. Send a fraction of real requests to both providers; compare responses offline; do not ship the new one until response-similarity passes your bar.
- Cost re-baseline. Token counts and per-million prices both shift. Plot expected monthly cost at current traffic before deciding.
- Latency budget. p50 and p95 differ. Measure at real prompt sizes, not toy examples.
- Refusal differential. What one model refuses, another may answer. Walk your safety set and decide which behaviour you want.
How we cut over without user-visible drift
- Wrap the new provider behind the existing
callLLMinterface - Run the new provider on 5% of traffic, then 25%, then 100% over two weeks
- Keep the old provider hot as fallback for one full release cycle
- Sample 1–5% of conversations into human review and watch for regressions in coaching quality
- Hold the user-facing voice and system prompts constant — only the model underneath changes
What changes after the switch
Usually nothing the user notices. Internally, the cost line item changes, the latency distribution shifts slightly, and the safety set may surface different edge cases. Plan a two-week observation window after 100% rollout before you call the switch done.
Frequently asked questions
Should I commit to one LLM provider?
No. Treat the LLM as a swappable component from day one. The cost of designing for one provider and switching later is dramatically higher than the up-front abstraction cost. Even a one-week wrapper saves months later.
What breaks first when switching providers?
Tool-use schemas. Each provider serializes function calls slightly differently. System-prompt behaviour is second — Claude follows system prompts more literally than GPT, which affects refusal patterns and persona stability.
Working on something similar?
T-Square architects, builds and operates production systems for learning, AI and custom software products. Talk to a senior engineer if you want a second opinion on your design or roadmap.
