Engineering Fluency OS

Why This Matters

When an LLM does not do what you want, you have two options: change the input (prompt engineering) or change the model (fine-tuning). Prompt engineering is fast, cheap, and requires no training data. Fine-tuning is slower and more expensive, but it can teach the model behaviors that no prompt can replicate.

Knowing when to prompt and when to fine-tune is a critical engineering decision. Get it wrong, and you either waste months fine-tuning when a good prompt would have worked, or you burn tokens on elaborate prompts when a quick fine-tune would have been more reliable and cheaper at scale.

Define Terms

Visual Model

Base ModelPre-trained LLM

+ PromptFew-shot / CoT

ResponseQuick, no training

Training Data100-10K examples

Fine-TuneFull / LoRA / QLoRA

Specialized ModelCustom behavior

Path A

Path B

The full process at a glance. Click Start tour to walk through each step.

Two paths: prompting (fast, no training) or fine-tuning (custom behavior, requires data).

Code Example

Code

// Decision framework: prompting vs fine-tuning

function shouldFineTune(scenario) {
  const scores = {
    promptQuality: scenario.promptAccuracy >= 0.9 ? -2 : 2,
    dataAvailable: scenario.labeledExamples >= 100 ? 1 : -2,
    costSensitive: scenario.queriesPerDay >= 10000 ? 2 : -1,
    formatCritical: scenario.needsExactFormat ? 2 : 0,
    domainSpecific: scenario.specializedDomain ? 1 : 0,
  };

  const total = Object.values(scores).reduce((a, b) => a + b, 0);
  return {
    recommendation: total > 0 ? "Fine-tune" : "Keep prompting",
    score: total,
    breakdown: scores
  };
}

// Example scenarios
console.log(shouldFineTune({
  promptAccuracy: 0.95,
  labeledExamples: 50,
  queriesPerDay: 100,
  needsExactFormat: false,
  specializedDomain: false
}));
// -> Keep prompting (prompts already work well)

console.log(shouldFineTune({
  promptAccuracy: 0.7,
  labeledExamples: 5000,
  queriesPerDay: 50000,
  needsExactFormat: true,
  specializedDomain: true
}));
// -> Fine-tune (prompts insufficient, data available)

// LoRA parameter savings
function loraParameters(modelParams, rank) {
  // LoRA adds two small matrices per layer: A (d x r) and B (r x d)
  // Instead of updating all d x d weights
  const d = Math.sqrt(modelParams); // simplified
  const fullParams = d * d;
  const loraParams = 2 * d * rank;
  console.log(`Full fine-tuning: ${fullParams.toLocaleString()} params`);
  console.log(`LoRA (rank ${rank}): ${loraParams.toLocaleString()} params`);
  console.log(`Reduction: ${(loraParams / fullParams * 100).toFixed(2)}%`);
}

loraParameters(1000000, 8); // 1M param layer, rank 8

Interactive Experiment

Try these exercises:

Pick a task you have used an LLM for. Score it on the decision framework above. Does it suggest prompting or fine-tuning?
Calculate: if your prompt template is 1,500 tokens and you make 10,000 queries/day at $0.03/1K tokens, what is your monthly cost? How much would you save if fine-tuning eliminated the template (reducing to 200 tokens per query)?
LoRA uses rank-8 matrices by default. If a layer has 4,096 input and output dimensions, how many parameters does full fine-tuning change vs. LoRA?
What happens if you fine-tune on bad training data? How would the model behave?

Quick Quiz

Coding Challenge

Cost Calculator: Prompt vs Fine-Tune

Write a function called `compareCosts` that takes: promptTokensPerQuery, fineTunedTokensPerQuery, queriesPerMonth, costPerToken, and trainingCost. Return an object with promptMonthlyCost, fineTunedMonthlyCost, and breakEvenMonths (how many months until fine-tuning becomes cheaper, or -1 if it never does).

Loading editor...

Real-World Usage

The prompting-vs-fine-tuning decision shapes every production LLM system:

OpenAI fine-tuning API: Companies fine-tune GPT models to match their brand voice, classification schemas, or domain terminology without including examples in every prompt.
LoRA in open source: Hugging Face hosts thousands of LoRA adapters for Llama and Mistral, each specializing the base model for a different task (SQL generation, medical Q&A, code review).
RLHF alignment: ChatGPT and Claude are fine-tuned with Reinforcement Learning from Human Feedback to be helpful, harmless, and honest — behaviors that cannot be achieved by prompting alone.
Domain adaptation: Legal, medical, and financial companies fine-tune models on proprietary data to handle specialized terminology and reasoning patterns.
Cost optimization: Companies that start with long prompt templates often switch to fine-tuned models as query volume grows, reducing per-query cost by 5-10x.

Fine-Tuning vs Prompting