Skip to main content
  1. All Posts/

LLM Fine-Tuning in Practice: Choosing Between Full Fine-Tuning and LoRA

Aaron
Author
Aaron
I only know that I know nothing.
Table of Contents

Introduction
#

Recently I needed to fine-tune a large language model for a project, and spent time comparing full fine-tuning against LoRA. I figured I might as well organize everything I learned about fine-tuning along the way. The idea that LLMs work out of the box sounds wonderful, but in practice you quickly realize that a pre-trained model only knows how to continue text, not how to follow instructions. If you want a model that actually does what you ask, supervised fine-tuning (SFT) is unavoidable. This post covers the core principles, approach selection, and some practical lessons I picked up, in hopes that it helps anyone else working on model deployment.

Why Pre-Trained Models Aren’t Enough
#

Let’s start with a basic fact: a pre-trained model only knows how to continue text, not hold a conversation.

This isn’t a bug, it’s by design. The pre-training objective is to learn “given the preceding context, predict the next token.” You type “the weather is beautiful today” and it continues with “perfect for a walk.” Great at continuation, terrible at following instructions 1.

Think of it this way: pre-training is like having someone read thousands of books. They’re incredibly knowledgeable, but nobody has ever taught them how to interact with people. Ask them “can you summarize this for me?” and they might respond “sure, show me what you’ve got!” without actually doing any summarizing. They don’t understand that “summarize” is a command to execute, not small talk.

SFT (Supervised Fine-Tuning) uses labeled instruction data to train the model to understand and follow human instructions. Without this step, the model is just a fancy text completer. With it, the model becomes a genuine assistant.

And SFT has a nice bonus: generalization. Train the model on summarization, sentiment analysis, and translation, and it might suddenly be able to do reasoning and rewriting too. This “learning by analogy” ability is what makes large models so fascinating.

Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning
#

Fine-tuning approaches fall into two main camps:

Category Description Best For
Full Fine-Tuning Updates all model parameters High-stakes domains (medical, finance), ample budget
Parameter-Efficient Fine-Tuning Updates only a subset of parameters General tasks, limited budget

Within the parameter-efficient camp, there are several specific approaches:

Method Characteristics Recommendation
LoRA Most stable results, closest to full fine-tuning Top pick
Prompt Tuning Results can be unstable Not recommended as first choice
Prefix Tuning A variant of Prompt Tuning Not recommended as first choice
Adapter Slows down inference Not recommended as first choice

How LoRA Works
#

The core idea behind LoRA (Low-Rank Adaptation) is elegant: freeze the pre-trained model’s original weights, and only add a low-rank decomposition matrix alongside each layer to learn incremental changes 2.

Specifically, given a weight matrix W in a model layer, LoRA rewrites it as:

W' = W + ΔW = W + B × A

Where B is a d-by-r matrix, A is an r-by-d matrix, and r is much smaller than d. During training, only A and B are updated while the original weights W stay frozen. This reduces the number of trainable parameters from d² to 2×d×r, a massive reduction.

Here’s an analogy: full fine-tuning is like sending an employee back to school for an entire degree. They learn everything from scratch, which is thorough but incredibly expensive. LoRA is like putting them through a two-week intensive workshop focused on job-relevant skills. Good enough for most situations 3.

# Example LoRA fine-tuning config using LLaMA-Factory
llamafactory-cli train \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --finetuning_type lora \
  --lora_rank 8 \
  --lora_target q_proj,v_proj \
  --dataset alpaca_zh \
  --output_dir ./output/lora_model

One sentence summary: if budget is no object, go with full fine-tuning. Otherwise, LoRA is the smart choice, and the results are remarkably close.

Three Practical Lessons
#

Theory aside, let’s look at some battle-tested lessons. The following insights come from real translation model fine-tuning projects, and each one is rather counter-intuitive.

Lesson 1: Fewer Examples Are Better
#

Most people intuitively assume that more examples lead to better results. Real-world testing says otherwise:

Approach Result
No examples (zero-shot) Unstable output
1 example (one-shot) Stable output, best results
Multiple examples (few-shot) No further improvement, possible degradation

One example is the sweet spot. More doesn’t help and might even confuse the model. It’s like teaching someone to cook: give them one recipe and they can follow it. Give them ten, and they’ll be paralyzed by choice 4.

Lesson 2: Data Quality Crushes Data Quantity
#

This one is even more counter-intuitive. In one translation project, the team filtered 200,000 Chinese-English sentence pairs down to the top 25% by quality (roughly 50,000 pairs). The model trained on just those 50,000 high-quality pairs outperformed the one trained on the full 200,000.

The reason is straightforward: the full dataset contains a large number of low-quality samples. These act as noise, actively misleading the model.

So the first step in any fine-tuning project shouldn’t be “find more data.” It should be “define what good data looks like.” Set quality standards first, then filter ruthlessly. Less is more 5.

Lesson 3: Model Size vs. Cost Trade-Off
#

Training costs vary dramatically across model sizes:

Parameters Hardware Training Time
38B 128 dedicated chips 11 days
2.6B 8x A100 GPUs 9 days

Bigger models still perform better, but you have to factor in cost. High-stakes domains (medical, finance) warrant large models with full fine-tuning. For general tasks (customer service, content generation), smaller models with LoRA offer lower costs and faster iteration. You don’t always need to bring in the “superstar.” The right fit matters more.

A Classic Cross-Lingual Approach
#

Here’s a scenario that comes up often: your model only understands Chinese and English, but now you need it to handle Thai. Thai data is scarce, and the model’s vocabulary doesn’t contain any Thai tokens. A Thai sentence gets tokenized into a chaotic mess of meaningless fragments, and the model can’t make sense of it at all.

How do you solve this? One proven approach is vocabulary expansion plus incremental fine-tuning, done in four steps:

  1. Train a tokenizer for the target language, so the system can recognize its basic units
  2. Add new language tokens to the existing vocabulary, essentially giving the model a “dictionary” update
  3. Do incremental fine-tuning with mixed data, letting the model build new knowledge on top of what it already knows
  4. Apply LoRA for efficient parameter fine-tuning, completing the adaptation at low cost

The results are dramatic. Before optimization, a sentence gets chopped into meaningless fragments. After, the same sentence is properly tokenized into 21 tokens, and the model can finally “read” it.

This approach generalizes well beyond language tasks. If you need your model to handle a specific vertical (healthcare, legal), you can add domain-specific terminology to the vocabulary first, then do incremental fine-tuning with domain data. No need to start from scratch. Just build on what the model already knows.

Evaluation: Don’t Rely Solely on Academic Metrics
#

One last point that’s easy to overlook: how you evaluate matters.

BLEU, ROUGE, and similar academic metrics are useful as reference points, but they only measure surface-level text similarity, not real-world effectiveness. A translation with a high BLEU score might read completely unnaturally. A summary with a high ROUGE score might miss critical information.

What actually works is human evaluation combined with A/B testing. Deploy two models in production and let real users decide which one is better. Results are what matter. User satisfaction is the ultimate benchmark.


  1. A pre-trained model is essentially an autoregressive language model whose training objective is to maximize the conditional probability P(xt | x_1, …, x{t-1}). It’s good at probabilistic text continuation, not at understanding instruction intent. ↩︎

  2. The LoRA paper was published by Hu et al. in 2021, titled “LoRA: Low-Rank Adaptation of Large Language Models.” The paper demonstrates that LoRA with rank r=4 or r=8 achieves results close to full fine-tuning on most tasks. ↩︎

  3. The parameter difference is massive. For a 7B model, full fine-tuning requires training all 7 billion parameters, while LoRA (r=8) typically trains less than 1% of that. ↩︎

  4. This phenomenon has been observed in Few-Shot Learning research as well, known as “demonstration sensitivity,” where models are highly sensitive to the selection and quantity of examples provided. ↩︎

  5. The “LIMA: Less Is More for Alignment” paper also validates this conclusion: a model trained on only 1,000 high-quality samples can match the performance of one trained on hundreds of thousands of samples. ↩︎