Stop Rewriting Code! mlx-tune Brings Unsloth to Apple Silicon
Stop Rewriting Code! mlx-tune Brings Unsloth to Apple Silicon
What if every time you wanted to test a fine-tuning idea, you had to burn $5 on cloud GPUs before knowing if it even worked? What if your Mac—sitting right there with 36GB, 64GB, even 512GB of unified memory—was basically a paperweight for LLM training?
Here's the dirty secret most developers won't admit: they're trapped in a two-codebase nightmare. Write training scripts for CUDA. Rewrite them for local testing. Debug on cloud instances at 3 AM. Repeat until your AWS bill triggers an existential crisis.
But what if your MacBook could run the exact same code as your cloud GPUs? No rewrites. No context switching. No "works on my machine" except it actually does.
Enter mlx-tune—the bridge that Apple Silicon users have been desperately waiting for. Built by a developer who was sick of the friction, this unofficial but brilliant wrapper around Apple's MLX framework gives you an Unsloth-compatible API that runs natively on M1/M2/M3/M4/M5 chips. Same code. Same workflow. Different silicon.
The implications are massive. Prototype on your MacBook Air during your commute. Push the identical script to an A100 cluster for production training. That's not convenience—that's code portability as a superpower.
Ready to stop paying the cloud tax for every experiment? Let's dive deep into why mlx-tune is about to become your most-used development tool.
What is mlx-tune?
mlx-tune is an open-source fine-tuning framework that wraps Apple's native MLX machine learning framework in an API that mirrors Unsloth—the gold standard for efficient LLM fine-tuning on NVIDIA GPUs.
Created by ARahim3, a developer who found himself switching between a MacBook M4 for daily work and cloud GPUs for training, mlx-tune solves what he calls the "Context Switch" problem. Unsloth relies on Triton kernels, which don't exist on macOS. Rather than maintaining two entirely different codebases, he built a compatibility layer that lets you write FastLanguageModel code once and run it anywhere.
Critical distinction: This is NOT trying to beat Unsloth on speed benchmarks. It's solving a workflow problem, not a performance problem. The goal is seamless portability between local Mac development and cloud GPU production.
The project started as unsloth-mlx but was renamed to avoid confusion with the official Unsloth project. At v0.4.25, it's already a mature ecosystem supporting:
- SFT, DPO, ORPO, GRPO, KTO, SimPO training methods
- Vision-Language Models (Gemma 4, Qwen3.5, PaliGemma, LLaVA, Pixtral)
- Audio models (5 TTS architectures, 7 STT architectures including streaming)
- Embedding models with contrastive learning
- OCR models with built-in CER/WER evaluation
- MoE architectures (39+ models, 128 experts)
- Continual pretraining with decoupled learning rates
The repository has gained significant traction on GitHub with thousands of downloads via PyPI, and it's actively maintained with weekly updates pushing new model support.
Key Features That Make mlx-tune Insane
Let's break down what makes this tool genuinely transformative for Mac-based ML engineers:
🔄 True Unsloth API Compatibility
The killer feature. Import FastLanguageModel from mlx_tune instead of unsloth, and your training script runs unchanged. Same parameters. Same methods. Same mental model. This isn't "inspired by" Unsloth—it's a deliberate API clone that prioritizes drop-in replacement over creative rebranding.
🧠 Unified Memory Exploitation
Apple Silicon's shared memory architecture means your RAM is your VRAM. On a Mac Studio with 512GB unified memory, you can load models that would require multiple A100s on CUDA. mlx-tune leverages MLX's memory-efficient kernels to maximize this advantage—no PCIe bottlenecks, no out-of-memory errors from fragmented GPU allocations.
🎯 Multi-Modal Native Support
Most "LLM fine-tuning" tools are actually just text tools. mlx-tune goes far beyond:
- Vision: Full VLM fine-tuning via
mlx-vlmintegration - Audio: TTS (Orpheus, OuteTTS, Spark, Sesame, Qwen3-TTS) and STT (Whisper, Moonshine, Qwen3-ASR, NVIDIA Canary, Voxtral, Parakeet TDT)
- Embeddings: BERT, ModernBERT, Qwen3-Embedding, Harrier with InfoNCE loss
- OCR: DeepSeek-OCR, GLM-OCR, olmOCR with character-level metrics
🏗️ MoE Architecture Mastery
Mixture of Experts models are notoriously tricky to fine-tune efficiently. mlx-tune auto-detects MoE layers and applies per-expert LoRA via LoRASwitchLinear—supporting 39+ architectures including Arcee Trinity-Nano's staggering 128 experts plus shared expert.
📦 Flexible Export Pipeline
Train locally, deploy anywhere. Save as HuggingFace format, merge LoRA weights into full models, or convert to GGUF for Ollama/llama.cpp inference. The convert() utility handles HF → MLX conversion for LLMs, TTS, and STT models.
⚡ Advanced Training Methods
Beyond basic SFT, you get production-grade RL methods: GRPO for reasoning (DeepSeek R1-style), DPO with proper log-probability loss, ORPO's combined approach, KTO for binary feedback, and SimPO without reference models.
Real-World Use Cases Where mlx-tune Dominates
1. The Frugal Prototyper
You're a solo developer with a MacBook Pro M3 (36GB). You want to fine-tune a 7B model for a customer support chatbot. Cloud GPU costs would eat your runway before you validate the concept. With mlx-tune, you run 50-step experiments locally, iterate on prompts and data formatting, then scale to cloud for the final 10,000-step training run. Cost savings: 80%+ on experimentation phase.
2. The Multi-Modal Product Team
Your startup needs document OCR with custom formatting recognition. Traditional pipeline: train text model on cloud, train vision model separately, glue together with fragile middleware. With mlx-tune's FastOCRModel and built-in CER/WER metrics, you fine-tune DeepSeek-OCR or GLM-OCR end-to-end on your Mac, evaluating character-level accuracy in real-time.
3. The Voice AI Innovator
Building a personalized TTS assistant? mlx-tune supports five distinct TTS architectures with automatic codec detection. Fine-tune Orpheus-3B for emotional speech, or Spark-TTS for zero-shot voice cloning—all on Apple Silicon. The TTSDataCollator handles sampling rate normalization automatically.
4. The Embedding Specialist
Semantic search for a niche domain (legal, medical, technical)? Standard embeddings fail on jargon. Use FastEmbeddingModel with InfoNCE contrastive loss to fine-tune Qwen3-Embedding or Harrier on your proprietary document pairs. The EmbeddingDataCollator generates in-batch negatives automatically—no manual negative mining required.
5. The MoE Researcher
Experimenting with Mixture of Experts for efficient serving? Arcee Trinity-Nano's 128 experts would be a configuration nightmare on most frameworks. mlx-tune auto-detects the architecture, applies per-expert LoRA, and even supports continual pretraining with decoupled learning rates for embeddings vs. transformer layers.
Step-by-Step Installation & Setup Guide
Getting started takes under five minutes. Here's the complete setup:
Prerequisites
- Hardware: Apple Silicon Mac (M1/M2/M3/M4/M5)
- OS: macOS 13.0 or later
- Memory: 8GB+ unified RAM (16GB+ strongly recommended)
- Python: 3.9 or newer
Installation Commands
# RECOMMENDED: Using uv (faster, more reliable dependency resolution)
uv pip install mlx-tune
# With audio support for TTS/STT fine-tuning
uv pip install 'mlx-tune[audio]'
brew install ffmpeg # Required system dependency for audio codecs
# Alternative: Standard pip
pip install mlx-tune
# Development install from source
git clone https://github.com/ARahim3/mlx-tune.git
cd mlx-tune
uv pip install -e .
Verification
# Quick import test
from mlx_tune import FastLanguageModel, SFTTrainer
print("mlx-tune loaded successfully!")
# Check MLX backend
import mlx.core as mx
print(f"MLX version: {mx.__version__}")
print(f"Default device: {mx.default_device()}")
Environment Optimization
For maximum performance on Apple Silicon:
# Enable performance governor (laptops only, desktop Macs ignore)
sudo powermetrics --samplers cpu_power -n 1 # Check current state
# Set environment variables for MLX
export MLX_CPU_COUNT=8 # Adjust to your core count
export MLX_METAL_DEVICE_WRAPPER_TYPE=1 # Enable Metal debugging if needed
Dependency Conflicts (Important!)
If using DeepSeek-OCR models, note the transformers version constraint:
# DeepSeek-OCR requires transformers < 5.0
uv pip install 'transformers>=4.45,<5.0' 'mlx-lm<0.31' 'mlx-vlm<0.4'
uv pip install mlx-tune --no-deps # Prevent dependency override
REAL Code Examples from the Repository
Let's examine actual code patterns from mlx-tune's documentation, with detailed explanations of how each works.
Example 1: Basic SFT Fine-Tuning Pipeline
This is the bread-and-butter workflow—notice how identical it is to Unsloth:
from mlx_tune import FastLanguageModel, SFTTrainer, SFTConfig
from datasets import load_dataset
# Load any HuggingFace model, quantized or full precision
# The 4-bit quantization dramatically reduces memory for quick experiments
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="mlx-community/Llama-3.2-1B-Instruct-4bit",
max_seq_length=2048, # Context window for training
load_in_4bit=True, # QLoRA: 4-bit base + 16-bit LoRA adapters
)
# Add LoRA adapters to attention layers
# r=16 gives good quality; increase to 32-64 for complex tasks
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank: controls adapter capacity
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Attention only
lora_alpha=16, # Scaling factor: typically equal to r
)
# Load dataset—using small slice for quick validation
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:100]")
# SFTTrainer API matches TRL exactly—zero learning curve
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
tokenizer=tokenizer,
args=SFTConfig(
output_dir="outputs",
per_device_train_batch_size=2,
learning_rate=2e-4, # Standard LoRA LR; 10x lower for full fine-tuning
max_steps=50, # Quick test; production: 1000-10000
),
)
trainer.train()
# Export options: adapters only, merged model, or GGUF for Ollama
model.save_pretrained("lora_model") # Smallest: just adapters
model.save_pretrained_merged("merged", tokenizer) # Full model for HF
model.save_pretrained_gguf("model", tokenizer) # For llama.cpp/Ollama
Key insight: The load_in_4bit=True flag enables QLoRA, where the frozen base model stays in 4-bit while only the small LoRA adapters train in 16-bit. This lets you fine-tune 70B models on 48GB Macs.
Example 2: Vision Model Fine-Tuning
Fine-tuning VLMs for image understanding tasks:
from mlx_tune import FastVisionModel, UnslothVisionDataCollator, VLMSFTTrainer
from mlx_tune.vlm import VLMSFTConfig
# Load vision-language model—auto-detects processor type
model, processor = FastVisionModel.from_pretrained(
"mlx-community/Qwen3.5-0.8B-bf16",
)
# Configure which components to train
# Vision layers frozen = faster training, less memory
# Language layers train = adapt to your task
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True, # Set False to freeze image encoder
finetune_language_layers=True, # Always train text decoder for task
r=16, lora_alpha=16,
)
# Required: enable training mode for gradient computation
FastVisionModel.for_training(model)
# UnslothVisionDataCollator handles image preprocessing automatically
trainer = VLMSFTTrainer(
model=model,
tokenizer=processor, # VLM uses processor, not just tokenizer
data_collator=UnslothVisionDataCollator(model, processor),
train_dataset=dataset, # Format: {"image": path, "conversations": [...]}
args=VLMSFTConfig(max_steps=30, learning_rate=2e-4),
)
trainer.train()
Critical detail: The UnslothVisionDataCollator is essential—it handles the complex multi-modal batching that standard data collators can't manage. Without it, image tensors and text tokens won't align properly.
Example 3: TTS Fine-Tuning with Auto-Detection
Text-to-speech fine-tuning that automatically handles model-specific quirks:
from mlx_tune import FastTTSModel, TTSSFTTrainer, TTSSFTConfig, TTSDataCollator
from datasets import load_dataset, Audio
# Auto-detects: model architecture, audio codec, token format, sampling rate
# Works with: Orpheus (SNAC), OuteTTS (DAC), Spark-TTS (BiCodec), etc.
model, tokenizer = FastTTSModel.from_pretrained(
"mlx-community/orpheus-3b-0.1-ft-bf16"
)
# Standard LoRA configuration
model = FastTTSModel.get_peft_model(model, r=16, lora_alpha=16)
# Dataset must match model's expected sampling rate
# Orpheus uses 24kHz; Spark-TTS uses 16kHz
dataset = load_dataset("MrDragonFox/Elise", split="train[:100]")
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
# TTSDataCollator handles codec-specific audio tokenization
trainer = TTSSFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=TTSDataCollator(model, tokenizer), # Auto-configured
train_dataset=dataset,
args=TTSSFTConfig(
output_dir="./tts_output",
max_steps=60, # TTS converges faster than LLMs
),
)
trainer.train()
Pro tip: The TTSDataCollator is doing heavy lifting behind the scenes—converting waveform to discrete tokens via the model's codec (SNAC for Orpheus, DAC for OuteTTS, BiCodec for Spark). Without this abstraction, you'd need to manually implement each codec's forward transform.
Example 4: Embedding Fine-Tuning with Contrastive Loss
Building domain-specific semantic search:
from mlx_tune import FastEmbeddingModel, EmbeddingSFTTrainer
from mlx_tune import EmbeddingSFTConfig, EmbeddingDataCollator
# Pooling strategy critical for task: mean for similarity, cls for classification
model, tokenizer = FastEmbeddingModel.from_pretrained(
"mlx-community/all-MiniLM-L6-v2-bf16",
pooling_strategy="mean", # Options: "mean", "cls", "last_token"
)
model = FastEmbeddingModel.get_peft_model(model, r=16, lora_alpha=16)
# In-batch negatives: each batch's other positives become negatives
# No manual negative mining needed!
trainer = EmbeddingSFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=EmbeddingDataCollator(model, tokenizer),
train_dataset=[
{"anchor": "How do I reset my password?",
"positive": "Password reset instructions for your account"},
{"anchor": "Shipping time to Germany",
"positive": "International delivery takes 5-7 business days"},
# ... more pairs
],
args=EmbeddingSFTConfig(
loss_type="infonce", # InfoNCE: standard contrastive loss
temperature=0.05, # Lower = sharper similarity discrimination
per_device_train_batch_size=32, # Larger batches = more negatives
max_steps=50,
),
)
trainer.train()
# Inference: encode and compute similarity
embeddings = model.encode(["Hello world", "Hi there"])
similarity = (embeddings[0] * embeddings[1]).sum().item()
print(f"Cosine similarity: {similarity:.3f}")
Why this matters: Traditional embedding fine-tuning requires hard negative mining—finding similar but wrong examples. InfoNCE with in-batch negatives uses the other positives in each batch as implicit negatives, dramatically simplifying data preparation.
Example 5: MoE Fine-Tuning (Zero Configuration)
Mixture of Experts models that configure themselves:
from mlx_tune import FastLanguageModel, SFTTrainer, SFTConfig
# Load MoE model—same API as dense models!
# Qwen3.5-35B-A3B: 35B total params, 3B active per token
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="mlx-community/Qwen3.5-35B-A3B-4bit",
max_seq_length=2048,
load_in_4bit=True, # Critical: 35B model needs quantization
)
# Same target_modules—MoE paths resolved automatically
# Prints: "MoE architecture detected — LoRA will target expert layers"
model = FastLanguageModel.get_peft_model(
model, r=8, # Lower rank for MoE (more parameters total)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
# Training proceeds identically to dense models
# Behind the scenes: LoRASwitchLinear wraps each expert
The magic: LoRASwitchLinear automatically detects SwitchLinear layers (MoE routing) and applies separate LoRA adapters per expert. Without this, you'd need to manually configure 128+ expert adapters for Trinity-Nano.
Advanced Usage & Best Practices
Memory Optimization Strategies
# For maximum model size on limited RAM
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="mlx-community/Mistral-7B-Instruct-v0.2-4bit",
max_seq_length=1024, # Reduce context window
load_in_4bit=True,
# Additional: set gradient checkpointing in SFTConfig
)
# In SFTConfig:
# gradient_checkpointing=True # Trade compute for memory
# per_device_train_batch_size=1 # Minimum viable batch
Response-Only Training Efficiency
Don't waste gradients on prompt tokens:
from mlx_tune import get_chat_template, train_on_responses_only
# Apply template (auto-detects from model name)
tokenizer = get_chat_template(tokenizer, chat_template="auto")
# Only compute loss on assistant responses
# Dramatically faster convergence for instruction tuning
trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)
Multi-Turn Conversation Extension
Merge multiple short conversations into training examples:
from mlx_tune import to_sharegpt, conversation_extension
# Convert to ShareGPT format, then extend conversations
# Increases effective context length training
extended = conversation_extension(
dataset,
num_turns=3, # Pack 3 exchanges per example
)
Continual Pretraining with Decoupled Rates
from mlx_tune import CPTTrainer, CPTConfig
# Embeddings need 10x lower LR to avoid catastrophic forgetting
trainer = CPTTrainer(
model=model, tokenizer=tokenizer,
train_dataset=raw_text_dataset,
args=CPTConfig(
learning_rate=5e-5,
embedding_learning_rate=5e-6, # Decoupled: prevents embedding drift
include_embeddings=True, # Auto-adds embed_tokens + lm_head
),
)
Comparison with Alternatives
| Feature | mlx-tune | Direct MLX | Unsloth (Cloud) | llama.cpp Fine-tuning |
|---|---|---|---|---|
| Platform | Apple Silicon | Apple Silicon | NVIDIA CUDA | Any (CPU/GPU) |
| API Style | Unsloth-compatible | Raw MLX | Original | GGUF scripts |
| Code Portability | ✅ Same script on Mac + cloud | ❌ Mac only | ❌ CUDA only | ❌ GGUF only |
| Training Methods | 6+ RL methods | SFT only | 6+ RL methods | LoRA only |
| Multi-Modal | ✅ VLM, TTS, STT, OCR, Embedding | ❌ Text only | ✅ Vision | ❌ Text only |
| MoE Support | ✅ 39+ architectures | Limited | Limited | ❌ |
| Memory Efficiency | ✅ Unified memory | ✅ Unified memory | ✅ Triton optimized | ⚠️ CPU slow |
| Production Scale | Prototype/local | Prototype/local | ✅ Full scale | ❌ |
| Setup Complexity | pip install |
Manual MLX | pip install |
Complex build |
When to choose what:
- mlx-tune: Local prototyping with cloud migration path, multi-modal needs, Apple Silicon ownership
- Direct MLX: Maximum control, custom kernel development, no API compatibility needed
- Unsloth cloud: Production training at scale, maximum speed, already have GPU budget
- llama.cpp: Inference-only deployment, edge devices, no training needed
FAQ: Developer Questions Answered
Does mlx-tune replace Unsloth?
No. It's a compatibility bridge, not a competitor. Use mlx-tune for local Mac development, then run the identical script on cloud GPUs with original Unsloth. The creator explicitly states this is for workflow portability, not performance claims.
Can I fine-tune 70B models on my MacBook Air?
With 4-bit quantization and QLoRA, yes. A 70B model in 4-bit uses ~35GB. With 8GB adapters and overhead, you need ~48GB unified memory—achievable on M3 Max (36GB) or M3 Ultra (128GB). For 8GB Macs, stick to 1B-3B models.
Why can't I export GGUF from quantized models?
This is an upstream limitation in mlx-lm, not mlx-tune. Workarounds: (1) use non-quantized base model for training, (2) dequantize during export then re-quantize with llama.cpp, or (3) use MLX format directly without GGUF conversion.
Is the Unsloth API really 100% compatible?
For core features: yes. Import paths differ (mlx_tune vs unsloth), but FastLanguageModel, SFTTrainer, DPOTrainer, etc. match signatures and behavior. Edge cases may differ—test your specific workflow before production migration.
How does performance compare to CUDA Unsloth?
MLX on Apple Silicon is competitive for inference, but training throughput typically lags CUDA + Triton. The value proposition is convenience and unified memory capacity, not raw speed. For large models that don't fit in GPU VRAM, Mac unified memory can actually win.
Can I contribute custom model support?
Absolutely. The project welcomes contributions, especially: custom MLX kernels, test coverage, validation on different M-series chips, and batched audio/RL training (currently batch_size=1 for these modalities).
What about DeepSeek-OCR dependency issues?
DeepSeek-OCR's remote code imports LlamaFlashAttention2 removed in transformers>=5.0. Install with transformers<5.0 and mlx-lm<0.31, plus manual dependencies (addict, einops, matplotlib). DeepSeek-OCR-2 currently incompatible due to mlx-vlm>=0.4 requiring transformers>=5.0.
Conclusion: Your Mac Just Became a Fine-Tuning Powerhouse
The divide between local development and cloud training has been a tax on developer productivity for too long. mlx-tune demolishes that barrier with an elegant solution: write once, run anywhere, starting with the machine on your desk.
For Apple Silicon users, this isn't just convenience—it's enabling. The ability to prototype LLM fine-tuning on a MacBook Air during a flight, then push the identical code to an A100 cluster for production, changes the economics of AI experimentation. No more $50 "hello world" cloud bills. No more maintaining divergent codebases. No more context switching.
The technical breadth is genuinely impressive: SFT through GRPO, vision-language models, five TTS architectures, seven STT variants, embedding fine-tuning with contrastive loss, OCR with built-in metrics, MoE support for 39+ architectures, and continual pretraining with decoupled learning rates. All wrapped in an API that feels familiar from day one.
Is it perfect? No—GGUF export from quantized models hits upstream limitations, some audio training is batch_size=1, and raw throughput won't beat CUDA for small models that fit in GPU VRAM. But for the workflow it targets, mlx-tune delivers exceptionally.
My take: If you develop on Mac and train on cloud, this should be your default local framework. The time saved from not rewriting code between environments pays for itself within a week.
Ready to stop paying the context-switching tax? ⭐ Star mlx-tune on GitHub, pip install mlx-tune, and run your first local fine-tuning job today. Your future self—and your cloud bill—will thank you.
Comments (0)
No comments yet. Be the first to share your thoughts!