Why we stopped fine-tuning, and started writing better prompts in fewer words

Grow2.ai spent six weeks fine-tuning a 70B open-weights model on Atlas Rentals messaging history. Result: a slower agent with a 4.2/5 internal evaluation score that cost €4,200 to train and €890/month to host. We replaced it on Day 43 with Claude Sonnet plus a 1,400-word system prompt — same evaluation harness, the prompt-only version scored 4.7/5 and cost €68/month.

The setup we tried first

The pitch we sold ourselves was straightforward: Atlas had 14 weeks of historical guest messages across five languages, the tone was consistent, and the per-unit knowledge wiki was structured. Fine-tuning would beat any prompt-only approach because the model would absorb voice and domain together.

We picked Llama-3.1-70B as base, prepared 11,200 conversation pairs with reviewer-validated outputs, ran SFT for ~38 hours on rented A100s. Total fine-tune cost: €4,180. We hosted on a serverless GPU endpoint at €0.0008 per 1k input tokens, ~€890/month at Atlas volumes.

The evaluation that surprised us

We ran the same 280-conversation eval suite against three configurations: fine-tuned 70B, Claude Sonnet with a short prompt, Claude Sonnet with a structured 1,400-word system prompt. Reviewers were blind to which output came from which model.

Fine-tuned 70B: 4.2/5 quality, 1.8s median latency, occasional drift on edge cases not covered in training data.
Claude Sonnet, short prompt: 4.4/5 quality, 0.9s latency, generic tone — "sounded like a chain hotel".
Claude Sonnet, full prompt: 4.7/5 quality, 1.0s latency, consistently in Atlas voice on edge cases.

The fine-tuned model was not just expensive — it was worse. The 1,400-word prompt encoded the same tone and the same per-unit knowledge with explicit, debuggable rules. When a guest reported a broken kettle in unit 47, we could read the prompt and predict the response. With the fine-tune, we could only run it and hope.

The framework we use now

Grow2.ai now decides between fine-tuning and prompt engineering by asking three questions. If the answer is "no" to all three, we use a frontier model with a structured prompt:

Is the task domain so narrow that frontier models genuinely fail on it? Code generation in a proprietary DSL, medical terminology that frontier models hedge on. Not customer support.
Will the agent run at >100k requests/day per client? At that volume, the per-token economics start to favor self-hosted. Below that, hosting overhead wins.
Does the client require on-premise inference for compliance? Healthcare, legal, government — sometimes yes. Hospitality, retail, dental — never.

For SMB pilots — Grow2.ai's entire book of business — the answer is consistently no on all three. We default to Claude Sonnet, write the prompt by hand with the tone editor and the client's front-desk lead, and ship.

What we tell clients now

When a prospective client asks for fine-tuning specifically, we walk them through the cost ledger above. We have lost two pilots over this — both clients had been pre-sold on fine-tuning by previous vendors and were not interested in revisiting the assumption. That is fine. Grow2.ai charges for outcomes; if the outcome path runs through a 70B fine-tune, somebody else is the right vendor.

The setup we tried first

The evaluation that surprised us

The framework we use now

What we tell clients now

Have a problem this note describes? Bring it to a call.