Master Class: Fine-Tuning Microsoft’s Phi-3.5 MoE for Edge Devices

Master Class: Fine-Tuning Microsoft’s Phi-3.5 MoE for Edge Devices

Date: January 3, 2026
Category: Artificial Intelligence / Edge AI
Reading Time: 25 Minutes


1. Introduction: Why “Small” is the New Big

By 2026, the AI hype cycle has shifted. We are no longer impressed by 1-trillion parameter models that require a nuclear power plant to run. The real engineering challenge—and the real money—is in Edge AI.

Companies want intelligence that runs on a doctor’s laptop, a factory’s control unit, or a lawyer’s secure on-prem server. They want zero latency, zero data leakage, and zero recurring cloud costs.

This brings us to Phi-3.5 MoE (Mixture of Experts). Released by Microsoft in late 2024, it remains the gold standard for “punching above its weight.”

  • Total Parameters: 42 Billion (The knowledge capacity)
  • Active Parameters: 6.6 Billion (The inference cost)
  • Context Window: 128k Tokens

Because it only activates 6.6B parameters per token, it runs on consumer hardware. But because it has 42B parameters of knowledge, it reasons like a much larger model. In this guide, we will fine-tune this beast on a custom dataset and deploy it to a standard laptop.


2. The Architecture: Understanding MoE

Before we write code, you must understand what you are training.

A standard “Dense” model (like Llama-3-8B) uses 100% of its neural network for every single word it generates. This is inefficient. Does the model really need its “French Poetry” neurons to answer a Python coding question?

Mixture of Experts (MoE) solves this. Phi-3.5 MoE is composed of 16 distinct “Expert” neural networks. A “Router” sits in front of them.

  1. Input: “How do I reverse a list in Python?”
  2. Router: “This is a coding task. I will activate Expert #4 (Coding) and Expert #9 (Logic).”
  3. Inference: Only those two experts run. The other 14 stay asleep.

The Fine-Tuning Challenge: When we fine-tune an MoE, we must ensure we don’t break the router. If the router forgets how to pick the right expert, the model becomes lobotomized.


3. Prerequisites & Hardware

You do not need an H100 cluster for this. Thanks to the Unsloth library, we can fit this into surprisingly small GPUs.

Training Hardware (Minimum)

  • GPU: NVIDIA RTX 3090 / 4090 (24GB VRAM) OR A10G / T4 (Cloud).
  • RAM: 32GB System RAM.
  • Disk: 100GB NVMe (The base model is large).

Inference Hardware (Target Device)

  • MacBook Pro: M2/M3 Max (32GB+ RAM).
  • NVIDIA Jetson Orin: (For robotics).
  • Gaming Laptop: RTX 4060 or higher.

4. Step 1: The Environment (Unsloth)

We will use Unsloth. It is the only library in 2026 that efficiently handles QLoRA (Quantized Low-Rank Adaptation) for MoE models without memory spikes.

# 1. Create a fresh Conda environment
conda create --name phi-moe python=3.11 -y
conda activate phi-moe

# 2. Install Pytorch (Check your CUDA version!)
pip install torch torchvision torchaudio --index-url <a href="https://download.pytorch.org/whl/cu121">https://download.pytorch.org/whl/cu121</a>

# 3. Install Unsloth and Xformers
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

5. Step 2: Data Preparation

MoE models are extremely sensitive to data quality. “Garbage in, Garbage out” applies doubly here. We will use the Alpaca format, but you must ensure your system prompts are clear.

Let’s assume we are training a “Medical Diagnosis Assistant.”

[
  {
    "instruction": "Patient presents with sharp pain in the lower right abdomen, fever, and nausea. Provide a differential diagnosis.",
    "input": "",
    "output": "Based on the clinical presentation of right lower quadrant pain, fever, and nausea, the primary differential diagnosis is Acute Appendicitis. Other considerations include: \n1. Kidney Stones (Nephrolithiasis)\n2. Ovarian Cyst rupture (if female)\n3. Gastroenteritis."
  },
  {
    "instruction": "Interpret the following blood test results.",
    "input": "WBC: 18,000/uL, Hemoglobin: 14 g/dL",
    "output": "The WBC count is elevated (Leukocytosis), suggesting an active infection or inflammation. Hemoglobin is within normal limits."
  }
]

Save this as medical_dataset.json.


6. Step 3: The Training Script

Create a file named train_phi_moe.py. We will use 4-bit quantization to fit the 42B parameters into memory.

A. Load Model

from unsloth import FastLanguageModel
import torch

max_seq_length = 4096 # Supports up to 128k, but OOM risk increases
dtype = None # Auto detection
load_in_4bit = True # Essential for consumer GPUs

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3.5-MoE-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Add LoRA adapters
# We target ALL linear modules to ensure experts are tuned
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Optimized = 0
    bias = "none",    # Optimized = "none"
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

B. Format Data

from datasets import load_dataset

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts, }

dataset = load_dataset("json", data_files="medical_dataset.json", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True)

C. Train

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # Increase for real training (e.g., 500-1000)
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer.train()

7. Step 4: Quantization & Export (The Edge Step)

This is where most tutorials stop, but for Edge AI, this is just the beginning. We cannot deploy the raw LoRA adapters easily. We need to merge them and convert the model to GGUF format.

GGUF is a binary format optimized for CPU + Apple Metal inference.

# Merge LoRA and Save to GGUF
# quantization_method options: "q4_k_m", "q5_k_m", "q8_0"

model.save_pretrained_gguf("model_q4_k_m", tokenizer, quantization_method = "q4_k_m")
print("GGUF Saved!")

This process will create a file named model_q4_k_m.gguf. This single file contains your entire fine-tuned brain, compressed to roughly 24GB (for the MoE) or smaller depending on quantization.


8. Step 5: Running on the Edge (Llama.cpp)

Now, transfer that .gguf file to your MacBook or Edge Server. You don’t need Python or Pytorch anymore. You just need llama.cpp.

MacBook / Linux Terminal:

# 1. Download llama.cpp
git clone <a href="https://github.com/ggerganov/llama.cpp">https://github.com/ggerganov/llama.cpp</a>
cd llama.cpp
make

# 2. Run the model
./main -m ./model_q4_k_m.gguf \
  -n 512 \
  --color \
  -p "### Instruction: Diagnosing a patient with fever. ### Response:"

Why this matters?

On an M3 Max MacBook, this setup will run at 30-40 tokens per second. That is faster than reading speed. You have a medically fine-tuned, reasoning-heavy AI running offline, with no data leaving the device.


9. Conclusion

Fine-tuning Phi-3.5 MoE is the perfect middle ground for 2026. It offers the reasoning capability of a 40B+ model with the inference cost of a 7B model. By combining Unsloth for training and GGUF for deployment, you bridge the gap between “Research Demo” and “Product.”

Related reading


Author update

I will keep this post updated as new results or tools appear. If you want a deeper dive on any section, tell me what to prioritize.

Leave a Reply

Your email address will not be published. Required fields are marked *