Stop Rewriting Code! mlx-tune Brings Unsloth to Apple Silicon

What if every time you wanted to test a fine-tuning idea, you had to burn $5 on cloud GPUs before knowing if it even worked? What if your Mac—sitting right there with 36GB, 64GB, even 512GB of unified memory—was basically a paperweight for LLM training?

Here's the dirty secret most developers won't admit: they're trapped in a two-codebase nightmare. Write training scripts for CUDA. Rewrite them for local testing. Debug on cloud instances at 3 AM. Repeat until your AWS bill triggers an existential crisis.

But what if your MacBook could run the exact same code as your cloud GPUs? No rewrites. No context switching. No "works on my machine" except it actually does.

Enter mlx-tune—the bridge that Apple Silicon users have been desperately waiting for. Built by a developer who was sick of the friction, this unofficial but brilliant wrapper around Apple's MLX framework gives you an Unsloth-compatible API that runs natively on M1/M2/M3/M4/M5 chips. Same code. Same workflow. Different silicon.

The implications are massive. Prototype on your MacBook Air during your commute. Push the identical script to an A100 cluster for production training. That's not convenience—that's code portability as a superpower.

Ready to stop paying the cloud tax for every experiment? Let's dive deep into why mlx-tune is about to become your most-used development tool.

What is mlx-tune?

mlx-tune is an open-source fine-tuning framework that wraps Apple's native MLX machine learning framework in an API that mirrors Unsloth—the gold standard for efficient LLM fine-tuning on NVIDIA GPUs.

Created by ARahim3, a developer who found himself switching between a MacBook M4 for daily work and cloud GPUs for training, mlx-tune solves what he calls the "Context Switch" problem. Unsloth relies on Triton kernels, which don't exist on macOS. Rather than maintaining two entirely different codebases, he built a compatibility layer that lets you write FastLanguageModel code once and run it anywhere.

Critical distinction: This is NOT trying to beat Unsloth on speed benchmarks. It's solving a workflow problem, not a performance problem. The goal is seamless portability between local Mac development and cloud GPU production.

The project started as unsloth-mlx but was renamed to avoid confusion with the official Unsloth project. At v0.4.25, it's already a mature ecosystem supporting:

SFT, DPO, ORPO, GRPO, KTO, SimPO training methods
Vision-Language Models (Gemma 4, Qwen3.5, PaliGemma, LLaVA, Pixtral)
Audio models (5 TTS architectures, 7 STT architectures including streaming)
Embedding models with contrastive learning
OCR models with built-in CER/WER evaluation
MoE architectures (39+ models, 128 experts)
Continual pretraining with decoupled learning rates

The repository has gained significant traction on GitHub with thousands of downloads via PyPI, and it's actively maintained with weekly updates pushing new model support.

Key Features That Make mlx-tune Insane

Let's break down what makes this tool genuinely transformative for Mac-based ML engineers:

🔄 True Unsloth API Compatibility

The killer feature. Import FastLanguageModel from mlx_tune instead of unsloth, and your training script runs unchanged. Same parameters. Same methods. Same mental model. This isn't "inspired by" Unsloth—it's a deliberate API clone that prioritizes drop-in replacement over creative rebranding.

🧠 Unified Memory Exploitation

Apple Silicon's shared memory architecture means your RAM is your VRAM. On a Mac Studio with 512GB unified memory, you can load models that would require multiple A100s on CUDA. mlx-tune leverages MLX's memory-efficient kernels to maximize this advantage—no PCIe bottlenecks, no out-of-memory errors from fragmented GPU allocations.

🎯 Multi-Modal Native Support

Most "LLM fine-tuning" tools are actually just text tools. mlx-tune goes far beyond:

Vision: Full VLM fine-tuning via mlx-vlm integration
Audio: TTS (Orpheus, OuteTTS, Spark, Sesame, Qwen3-TTS) and STT (Whisper, Moonshine, Qwen3-ASR, NVIDIA Canary, Voxtral, Parakeet TDT)
Embeddings: BERT, ModernBERT, Qwen3-Embedding, Harrier with InfoNCE loss
OCR: DeepSeek-OCR, GLM-OCR, olmOCR with character-level metrics

🏗️ MoE Architecture Mastery

Mixture of Experts models are notoriously tricky to fine-tune efficiently. mlx-tune auto-detects MoE layers and applies per-expert LoRA via LoRASwitchLinear—supporting 39+ architectures including Arcee Trinity-Nano's staggering 128 experts plus shared expert.

📦 Flexible Export Pipeline

Train locally, deploy anywhere. Save as HuggingFace format, merge LoRA weights into full models, or convert to GGUF for Ollama/llama.cpp inference. The convert() utility handles HF → MLX conversion for LLMs, TTS, and STT models.

⚡ Advanced Training Methods

Beyond basic SFT, you get production-grade RL methods: GRPO for reasoning (DeepSeek R1-style), DPO with proper log-probability loss, ORPO's combined approach, KTO for binary feedback, and SimPO without reference models.

Real-World Use Cases Where mlx-tune Dominates

1. The Frugal Prototyper

You're a solo developer with a MacBook Pro M3 (36GB). You want to fine-tune a 7B model for a customer support chatbot. Cloud GPU costs would eat your runway before you validate the concept. With mlx-tune, you run 50-step experiments locally, iterate on prompts and data formatting, then scale to cloud for the final 10,000-step training run. Cost savings: 80%+ on experimentation phase.

2. The Multi-Modal Product Team

Your startup needs document OCR with custom formatting recognition. Traditional pipeline: train text model on cloud, train vision model separately, glue together with fragile middleware. With mlx-tune's FastOCRModel and built-in CER/WER metrics, you fine-tune DeepSeek-OCR or GLM-OCR end-to-end on your Mac, evaluating character-level accuracy in real-time.

3. The Voice AI Innovator

Building a personalized TTS assistant? mlx-tune supports five distinct TTS architectures with automatic codec detection. Fine-tune Orpheus-3B for emotional speech, or Spark-TTS for zero-shot voice cloning—all on Apple Silicon. The TTSDataCollator handles sampling rate normalization automatically.

4. The Embedding Specialist

Semantic search for a niche domain (legal, medical, technical)? Standard embeddings fail on jargon. Use FastEmbeddingModel with InfoNCE contrastive loss to fine-tune Qwen3-Embedding or Harrier on your proprietary document pairs. The EmbeddingDataCollator generates in-batch negatives automatically—no manual negative mining required.

5. The MoE Researcher

Experimenting with Mixture of Experts for efficient serving? Arcee Trinity-Nano's 128 experts would be a configuration nightmare on most frameworks. mlx-tune auto-detects the architecture, applies per-expert LoRA, and even supports continual pretraining with decoupled learning rates for embeddings vs. transformer layers.

Step-by-Step Installation & Setup Guide

Getting started takes under five minutes. Here's the complete setup:

Prerequisites

Hardware: Apple Silicon Mac (M1/M2/M3/M4/M5)
OS: macOS 13.0 or later
Memory: 8GB+ unified RAM (16GB+ strongly recommended)
Python: 3.9 or newer

Installation Commands

# RECOMMENDED: Using uv (faster, more reliable dependency resolution)
uv pip install mlx-tune

# With audio support for TTS/STT fine-tuning
uv pip install 'mlx-tune[audio]'
brew install ffmpeg  # Required system dependency for audio codecs

# Alternative: Standard pip
pip install mlx-tune

# Development install from source
git clone https://github.com/ARahim3/mlx-tune.git
cd mlx-tune
uv pip install -e .

Verification

# Quick import test
from mlx_tune import FastLanguageModel, SFTTrainer
print("mlx-tune loaded successfully!")

# Check MLX backend
import mlx.core as mx
print(f"MLX version: {mx.__version__}")
print(f"Default device: {mx.default_device()}")

Environment Optimization

For maximum performance on Apple Silicon:

# Enable performance governor (laptops only, desktop Macs ignore)
sudo powermetrics --samplers cpu_power -n 1  # Check current state

# Set environment variables for MLX
export MLX_CPU_COUNT=8  # Adjust to your core count
export MLX_METAL_DEVICE_WRAPPER_TYPE=1  # Enable Metal debugging if needed

Dependency Conflicts (Important!)

If using DeepSeek-OCR models, note the transformers version constraint:

# DeepSeek-OCR requires transformers < 5.0
uv pip install 'transformers>=4.45,<5.0' 'mlx-lm<0.31' 'mlx-vlm<0.4'
uv pip install mlx-tune --no-deps  # Prevent dependency override

REAL Code Examples from the Repository

Let's examine actual code patterns from mlx-tune's documentation, with detailed explanations of how each works.

Example 1: Basic SFT Fine-Tuning Pipeline

This is the bread-and-butter workflow—notice how identical it is to Unsloth:

from mlx_tune import FastLanguageModel, SFTTrainer, SFTConfig
from datasets import load_dataset

# Load any HuggingFace model, quantized or full precision
# The 4-bit quantization dramatically reduces memory for quick experiments
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="mlx-community/Llama-3.2-1B-Instruct-4bit",
    max_seq_length=2048,      # Context window for training
    load_in_4bit=True,        # QLoRA: 4-bit base + 16-bit LoRA adapters
)

# Add LoRA adapters to attention layers
# r=16 gives good quality; increase to 32-64 for complex tasks
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                     # LoRA rank: controls adapter capacity
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Attention only
    lora_alpha=16,            # Scaling factor: typically equal to r
)

# Load dataset—using small slice for quick validation
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:100]")

# SFTTrainer API matches TRL exactly—zero learning curve
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        learning_rate=2e-4,   # Standard LoRA LR; 10x lower for full fine-tuning
        max_steps=50,         # Quick test; production: 1000-10000
    ),
)
trainer.train()

# Export options: adapters only, merged model, or GGUF for Ollama
model.save_pretrained("lora_model")           # Smallest: just adapters
model.save_pretrained_merged("merged", tokenizer)  # Full model for HF
model.save_pretrained_gguf("model", tokenizer)     # For llama.cpp/Ollama

Key insight: The load_in_4bit=True flag enables QLoRA, where the frozen base model stays in 4-bit while only the small LoRA adapters train in 16-bit. This lets you fine-tune 70B models on 48GB Macs.

Example 2: Vision Model Fine-Tuning

Fine-tuning VLMs for image understanding tasks:

from mlx_tune import FastVisionModel, UnslothVisionDataCollator, VLMSFTTrainer
from mlx_tune.vlm import VLMSFTConfig

# Load vision-language model—auto-detects processor type
model, processor = FastVisionModel.from_pretrained(
    "mlx-community/Qwen3.5-0.8B-bf16",
)

# Configure which components to train
# Vision layers frozen = faster training, less memory
# Language layers train = adapt to your task
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,    # Set False to freeze image encoder
    finetune_language_layers=True,  # Always train text decoder for task
    r=16, lora_alpha=16,
)

# Required: enable training mode for gradient computation
FastVisionModel.for_training(model)

# UnslothVisionDataCollator handles image preprocessing automatically
trainer = VLMSFTTrainer(
    model=model,
    tokenizer=processor,  # VLM uses processor, not just tokenizer
    data_collator=UnslothVisionDataCollator(model, processor),
    train_dataset=dataset,  # Format: {"image": path, "conversations": [...]}
    args=VLMSFTConfig(max_steps=30, learning_rate=2e-4),
)
trainer.train()

Critical detail: The UnslothVisionDataCollator is essential—it handles the complex multi-modal batching that standard data collators can't manage. Without it, image tensors and text tokens won't align properly.

Example 3: TTS Fine-Tuning with Auto-Detection

Text-to-speech fine-tuning that automatically handles model-specific quirks:

from mlx_tune import FastTTSModel, TTSSFTTrainer, TTSSFTConfig, TTSDataCollator
from datasets import load_dataset, Audio

# Auto-detects: model architecture, audio codec, token format, sampling rate
# Works with: Orpheus (SNAC), OuteTTS (DAC), Spark-TTS (BiCodec), etc.
model, tokenizer = FastTTSModel.from_pretrained(
    "mlx-community/orpheus-3b-0.1-ft-bf16"
)

# Standard LoRA configuration
model = FastTTSModel.get_peft_model(model, r=16, lora_alpha=16)

# Dataset must match model's expected sampling rate
# Orpheus uses 24kHz; Spark-TTS uses 16kHz
dataset = load_dataset("MrDragonFox/Elise", split="train[:100]")
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))

# TTSDataCollator handles codec-specific audio tokenization
trainer = TTSSFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=TTSDataCollator(model, tokenizer),  # Auto-configured
    train_dataset=dataset,
    args=TTSSFTConfig(
        output_dir="./tts_output",
        max_steps=60,         # TTS converges faster than LLMs
    ),
)
trainer.train()

Pro tip: The TTSDataCollator is doing heavy lifting behind the scenes—converting waveform to discrete tokens via the model's codec (SNAC for Orpheus, DAC for OuteTTS, BiCodec for Spark). Without this abstraction, you'd need to manually implement each codec's forward transform.

Example 4: Embedding Fine-Tuning with Contrastive Loss

Building domain-specific semantic search:

from mlx_tune import FastEmbeddingModel, EmbeddingSFTTrainer
from mlx_tune import EmbeddingSFTConfig, EmbeddingDataCollator

# Pooling strategy critical for task: mean for similarity, cls for classification
model, tokenizer = FastEmbeddingModel.from_pretrained(
    "mlx-community/all-MiniLM-L6-v2-bf16",
    pooling_strategy="mean",  # Options: "mean", "cls", "last_token"
)
model = FastEmbeddingModel.get_peft_model(model, r=16, lora_alpha=16)

# In-batch negatives: each batch's other positives become negatives
# No manual negative mining needed!
trainer = EmbeddingSFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=EmbeddingDataCollator(model, tokenizer),
    train_dataset=[
        {"anchor": "How do I reset my password?", 
         "positive": "Password reset instructions for your account"},
        {"anchor": "Shipping time to Germany",
         "positive": "International delivery takes 5-7 business days"},
        # ... more pairs
    ],
    args=EmbeddingSFTConfig(
        loss_type="infonce",      # InfoNCE: standard contrastive loss
        temperature=0.05,         # Lower = sharper similarity discrimination
        per_device_train_batch_size=32,  # Larger batches = more negatives
        max_steps=50,
    ),
)
trainer.train()

# Inference: encode and compute similarity
embeddings = model.encode(["Hello world", "Hi there"])
similarity = (embeddings[0] * embeddings[1]).sum().item()
print(f"Cosine similarity: {similarity:.3f}")

Why this matters: Traditional embedding fine-tuning requires hard negative mining—finding similar but wrong examples. InfoNCE with in-batch negatives uses the other positives in each batch as implicit negatives, dramatically simplifying data preparation.

Example 5: MoE Fine-Tuning (Zero Configuration)

Mixture of Experts models that configure themselves:

from mlx_tune import FastLanguageModel, SFTTrainer, SFTConfig

# Load MoE model—same API as dense models!
# Qwen3.5-35B-A3B: 35B total params, 3B active per token
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="mlx-community/Qwen3.5-35B-A3B-4bit",
    max_seq_length=2048,
    load_in_4bit=True,        # Critical: 35B model needs quantization
)

# Same target_modules—MoE paths resolved automatically
# Prints: "MoE architecture detected — LoRA will target expert layers"
model = FastLanguageModel.get_peft_model(
    model, r=8,               # Lower rank for MoE (more parameters total)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)

# Training proceeds identically to dense models
# Behind the scenes: LoRASwitchLinear wraps each expert

The magic: LoRASwitchLinear automatically detects SwitchLinear layers (MoE routing) and applies separate LoRA adapters per expert. Without this, you'd need to manually configure 128+ expert adapters for Trinity-Nano.

Advanced Usage & Best Practices

Memory Optimization Strategies

# For maximum model size on limited RAM
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="mlx-community/Mistral-7B-Instruct-v0.2-4bit",
    max_seq_length=1024,      # Reduce context window
    load_in_4bit=True,
    # Additional: set gradient checkpointing in SFTConfig
)

# In SFTConfig:
# gradient_checkpointing=True  # Trade compute for memory
# per_device_train_batch_size=1  # Minimum viable batch

Response-Only Training Efficiency

Don't waste gradients on prompt tokens:

from mlx_tune import get_chat_template, train_on_responses_only

# Apply template (auto-detects from model name)
tokenizer = get_chat_template(tokenizer, chat_template="auto")

# Only compute loss on assistant responses
# Dramatically faster convergence for instruction tuning
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Multi-Turn Conversation Extension

Merge multiple short conversations into training examples:

from mlx_tune import to_sharegpt, conversation_extension

# Convert to ShareGPT format, then extend conversations
# Increases effective context length training
extended = conversation_extension(
    dataset, 
    num_turns=3,  # Pack 3 exchanges per example
)

Continual Pretraining with Decoupled Rates

from mlx_tune import CPTTrainer, CPTConfig

# Embeddings need 10x lower LR to avoid catastrophic forgetting
trainer = CPTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=raw_text_dataset,
    args=CPTConfig(
        learning_rate=5e-5,
        embedding_learning_rate=5e-6,  # Decoupled: prevents embedding drift
        include_embeddings=True,       # Auto-adds embed_tokens + lm_head
    ),
)

Comparison with Alternatives

Feature	mlx-tune	Direct MLX	Unsloth (Cloud)	llama.cpp Fine-tuning
Platform	Apple Silicon	Apple Silicon	NVIDIA CUDA	Any (CPU/GPU)
API Style	Unsloth-compatible	Raw MLX	Original	GGUF scripts
Code Portability	✅ Same script on Mac + cloud	❌ Mac only	❌ CUDA only	❌ GGUF only
Training Methods	6+ RL methods	SFT only	6+ RL methods	LoRA only
Multi-Modal	✅ VLM, TTS, STT, OCR, Embedding	❌ Text only	✅ Vision	❌ Text only
MoE Support	✅ 39+ architectures	Limited	Limited	❌
Memory Efficiency	✅ Unified memory	✅ Unified memory	✅ Triton optimized	⚠️ CPU slow
Production Scale	Prototype/local	Prototype/local	✅ Full scale	❌
Setup Complexity	`pip install`	Manual MLX	`pip install`	Complex build

When to choose what:

mlx-tune: Local prototyping with cloud migration path, multi-modal needs, Apple Silicon ownership
Direct MLX: Maximum control, custom kernel development, no API compatibility needed
Unsloth cloud: Production training at scale, maximum speed, already have GPU budget
llama.cpp: Inference-only deployment, edge devices, no training needed

FAQ: Developer Questions Answered

Does mlx-tune replace Unsloth?

No. It's a compatibility bridge, not a competitor. Use mlx-tune for local Mac development, then run the identical script on cloud GPUs with original Unsloth. The creator explicitly states this is for workflow portability, not performance claims.

Can I fine-tune 70B models on my MacBook Air?

With 4-bit quantization and QLoRA, yes. A 70B model in 4-bit uses ~35GB. With 8GB adapters and overhead, you need ~48GB unified memory—achievable on M3 Max (36GB) or M3 Ultra (128GB). For 8GB Macs, stick to 1B-3B models.

Why can't I export GGUF from quantized models?

This is an upstream limitation in mlx-lm, not mlx-tune. Workarounds: (1) use non-quantized base model for training, (2) dequantize during export then re-quantize with llama.cpp, or (3) use MLX format directly without GGUF conversion.

Is the Unsloth API really 100% compatible?

For core features: yes. Import paths differ (mlx_tune vs unsloth), but FastLanguageModel, SFTTrainer, DPOTrainer, etc. match signatures and behavior. Edge cases may differ—test your specific workflow before production migration.

How does performance compare to CUDA Unsloth?

MLX on Apple Silicon is competitive for inference, but training throughput typically lags CUDA + Triton. The value proposition is convenience and unified memory capacity, not raw speed. For large models that don't fit in GPU VRAM, Mac unified memory can actually win.

Can I contribute custom model support?

Absolutely. The project welcomes contributions, especially: custom MLX kernels, test coverage, validation on different M-series chips, and batched audio/RL training (currently batch_size=1 for these modalities).

What about DeepSeek-OCR dependency issues?

DeepSeek-OCR's remote code imports LlamaFlashAttention2 removed in transformers>=5.0. Install with transformers<5.0 and mlx-lm<0.31, plus manual dependencies (addict, einops, matplotlib). DeepSeek-OCR-2 currently incompatible due to mlx-vlm>=0.4 requiring transformers>=5.0.

Conclusion: Your Mac Just Became a Fine-Tuning Powerhouse

The divide between local development and cloud training has been a tax on developer productivity for too long. mlx-tune demolishes that barrier with an elegant solution: write once, run anywhere, starting with the machine on your desk.

For Apple Silicon users, this isn't just convenience—it's enabling. The ability to prototype LLM fine-tuning on a MacBook Air during a flight, then push the identical code to an A100 cluster for production, changes the economics of AI experimentation. No more $50 "hello world" cloud bills. No more maintaining divergent codebases. No more context switching.

The technical breadth is genuinely impressive: SFT through GRPO, vision-language models, five TTS architectures, seven STT variants, embedding fine-tuning with contrastive loss, OCR with built-in metrics, MoE support for 39+ architectures, and continual pretraining with decoupled learning rates. All wrapped in an API that feels familiar from day one.

Is it perfect? No—GGUF export from quantized models hits upstream limitations, some audio training is batch_size=1, and raw throughput won't beat CUDA for small models that fit in GPU VRAM. But for the workflow it targets, mlx-tune delivers exceptionally.

My take: If you develop on Mac and train on cloud, this should be your default local framework. The time saved from not rewriting code between environments pays for itself within a week.

Ready to stop paying the context-switching tax? ⭐ Star mlx-tune on GitHub, pip install mlx-tune, and run your first local fine-tuning job today. Your future self—and your cloud bill—will thank you.