Field notes · FN-014 · Why we stopped fine-tuning, and started writing better prompts in fewer words
Drawing G2-FN-014 · Field journal · entryEngineering

Why we stopped fine-tuning, and started writing better prompts in fewer words

By Andrew Maryasov · Grow2.ai2026-04-09 · 12 min
Entry FN-014
Filed 2026-04-09
Read 12 min
Category Engineering
Languages — EN
Entry №
FN-014
Field log

Grow2.ai spent six weeks fine-tuning a 70B open-weights model on Atlas Rentals messaging history. Result: a slower agent with a 4.2/5 internal evaluation score that cost €4,200 to train and €890/month to host. We replaced it on Day 43 with Claude Sonnet plus a 1,400-word system prompt — same evaluation harness, the prompt-only version scored 4.7/5 and cost €68/month.

The setup we tried first

The pitch we sold ourselves was straightforward: Atlas had 14 weeks of historical guest messages across five languages, the tone was consistent, and the per-unit knowledge wiki was structured. Fine-tuning would beat any prompt-only approach because the model would absorb voice and domain together.

We picked Llama-3.1-70B as base, prepared 11,200 conversation pairs with reviewer-validated outputs, ran SFT for ~38 hours on rented A100s. Total fine-tune cost: €4,180. We hosted on a serverless GPU endpoint at €0.0008 per 1k input tokens, ~€890/month at Atlas volumes.

The evaluation that surprised us

We ran the same 280-conversation eval suite against three configurations: fine-tuned 70B, Claude Sonnet with a short prompt, Claude Sonnet with a structured 1,400-word system prompt. Reviewers were blind to which output came from which model.

  • Fine-tuned 70B: 4.2/5 quality, 1.8s median latency, occasional drift on edge cases not covered in training data.
  • Claude Sonnet, short prompt: 4.4/5 quality, 0.9s latency, generic tone — "sounded like a chain hotel".
  • Claude Sonnet, full prompt: 4.7/5 quality, 1.0s latency, consistently in Atlas voice on edge cases.

The fine-tuned model was not just expensive — it was worse. The 1,400-word prompt encoded the same tone and the same per-unit knowledge with explicit, debuggable rules. When a guest reported a broken kettle in unit 47, we could read the prompt and predict the response. With the fine-tune, we could only run it and hope.

The framework we use now

Grow2.ai now decides between fine-tuning and prompt engineering by asking three questions. If the answer is "no" to all three, we use a frontier model with a structured prompt:

  1. Is the task domain so narrow that frontier models genuinely fail on it? Code generation in a proprietary DSL, medical terminology that frontier models hedge on. Not customer support.
  2. Will the agent run at >100k requests/day per client? At that volume, the per-token economics start to favor self-hosted. Below that, hosting overhead wins.
  3. Does the client require on-premise inference for compliance? Healthcare, legal, government — sometimes yes. Hospitality, retail, dental — never.

For SMB pilots — Grow2.ai's entire book of business — the answer is consistently no on all three. We default to Claude Sonnet, write the prompt by hand with the tone editor and the client's front-desk lead, and ship.

What we tell clients now

When a prospective client asks for fine-tuning specifically, we walk them through the cost ledger above. We have lost two pilots over this — both clients had been pre-sold on fine-tuning by previous vendors and were not interested in revisiting the assumption. That is fine. Grow2.ai charges for outcomes; if the outcome path runs through a 70B fine-tune, somebody else is the right vendor.

Have a problem this note describes? Bring it to a call.

Field notes are written for the version of Grow2.ai that will run into the same problem in eight months. If one of them describes your situation, that's usually a good sign we should talk.

▸ Commission a pilot
grow2.ai

An engineering practice for AI agents in customer operations. Drawn, deployed and signed-off in Kyiv since 2021.

SOC 2 · in auditEU-CentralDOU 4.9 ★

Office

Kyiv · UAMon–Fri · 09:00–19:00 EET+380 44 000 0000hello@grow2.ai
© 2021–2026 grow2.ai · An Auspex Co. practice · All drawings & documentation property of the bearer.Set in Archivo & JetBrains Mono · Drawing G2-026 · Rev B
Why we stopped fine-tuning, and started writing better prompts in fewer words · Grow2.ai field notes · Grow2.ai