oMLX: The Secret Weapon for Local LLMs on Apple Silicon
oMLX: The Secret Weapon for Local LLMs on Apple Silicon
Your Mac is more powerful than you think. While developers burn through hundreds of dollars monthly on cloud GPU credits, a quiet revolution is happening on Apple Silicon. The M4 Max in your MacBook Pro? It can run a 70B parameter model locally—if you have the right inference engine. The problem? Every tool you've tried forces a brutal choice: brain-dead simple but cripplingly slow, or blazing fast but requiring a PhD in distributed systems to configure.
What if you didn't have to choose?
Meet oMLX, the LLM inference server that transforms your Mac into a production-grade AI workstation—managed entirely from your menu bar. Built by a developer who was fed up with existing solutions, oMLX combines continuous batching, tiered KV caching across RAM and SSD, and multi-model serving into a package so polished it feels like Apple itself built it. No terminal gymnastics. No Docker containers eating your RAM. Just download, drag to Applications, and start running models that would cost you $2/hour on the cloud.
The secret sauce? oMLX persists your KV cache across a hot in-memory tier and cold SSD tier—even when conversations change context mid-stream. Past context stays cached and reusable across requests. For developers using Claude Code and similar tools, this isn't just convenient. It's transformative.
What is oMLX?
oMLX is an open-source LLM inference server architected specifically for Apple Silicon, created by Jun Kim and hosted at github.com/jundot/omlx. Born from a simple frustration—every existing server demanded trade-offs between convenience and control—oMLX represents a fundamentally different approach to local AI infrastructure.
The project's philosophy is radical in its simplicity: your Mac should be the best place to run AI, not a compromise you settle for. This means native macOS integration through a PyObjC menu bar app (no Electron bloat), intelligent memory management that prevents system-wide OOM crashes, and performance optimizations that squeeze every teraflop from MLX, Apple's machine learning framework.
What makes oMLX genuinely exciting isn't just its feature list—it's the architectural coherence. Where competitors bolt features onto generic backends, oMLX was designed from the ground up around Apple Silicon's unified memory architecture. The tiered KV cache system, for instance, exploits the fact that Mac SSDs (especially on Pro/Max chips with enhanced bandwidth) can serve as genuine memory extensions, not just slow swap space.
The project is gaining serious traction among developers who've discovered that local inference isn't just about privacy or cost savings—it's about latency and control. When your model lives on your machine, you eliminate network round-trips, API rate limits, and vendor lock-in. oMLX makes this practical for real development workflows, not just weekend experiments.
Key Features That Separate oMLX From the Pack
Tiered KV Cache: RAM Meets SSD in Perfect Harmony
oMLX implements block-based KV cache management inspired by vLLM, but with a crucial innovation: a two-tier system that spans volatile and persistent storage.
- Hot tier (RAM): Frequently accessed cache blocks stay in ultra-fast memory for immediate retrieval
- Cold tier (SSD): When RAM fills, blocks offload to SSD in efficient safetensors format
The magic happens on cache restoration: when a request matches a previously-seen prefix, oMLX pulls blocks from SSD instead of recomputing from scratch. Even after server restart. This means your multi-turn coding sessions, document analyses, and agent workflows maintain context persistence that rivals cloud services.
Continuous Batching Throughput
Using mlx-lm's BatchGenerator, oMLX handles concurrent requests without the naive queue-and-wait approach of simpler servers. Multiple clients can hammer your local API simultaneously, with intelligent scheduling that maximizes GPU utilization. The default concurrency limit of 8 is configurable based on your model sizes and memory constraints.
Multi-Model Serving with Intelligent Eviction
Load LLMs, VLMs, embedding models, and rerankers simultaneously in one server instance. oMLX manages this complexity through:
- LRU eviction: Automatically unloads least-recently-used models when memory pressure hits
- Manual controls: Interactive load/unload badges in the admin panel
- Model pinning: Keep essential models permanently resident
- Per-model TTL: Auto-unload after configurable idle periods
- Process memory enforcement: Hard ceiling at system RAM minus 8GB (configurable) prevents macOS swap death spirals
Native macOS Integration
The PyObjC menu bar app delivers genuine Mac-native experience: start/stop server, monitor stats, auto-restart on crash, one-click updates. No terminal windows cluttering your workspace. No browser tabs to remember. Your AI infrastructure becomes as unobtrusive as WiFi control.
Drop-In API Compatibility
Replace OpenAI or Anthropic API calls by changing one URL. oMLX supports:
- Streaming chat completions with usage stats
- Anthropic Messages API with adaptive thinking
- Vision inputs (base64, URL, file paths)
- Embeddings and reranking endpoints
- Tool calling with structured output
Real-World Use Cases Where oMLX Dominates
1. Claude Code at Lightning Speed
The creator's original motivation: running smaller context models with Claude Code without hitting auto-compact at wrong moments. oMLX's context scaling adjusts reported token counts so compaction triggers optimally, while SSE keep-alive prevents read timeouts during long prefill operations. The result? A local coding assistant that feels as responsive as cloud APIs, with your entire codebase context persistent across hours of work.
2. Multi-Modal Development Workflows
Vision-language models aren't demos—they're production tools. oMLX runs Qwen3.5 Series, GLM-4V, Pixtral with the same continuous batching and tiered caching as text models. Feed screenshots of UI bugs, architecture diagrams, or handwritten notes directly into your local VLM. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) auto-detect with optimized prompts for document processing pipelines.
3. Private RAG Without the Cloud Bill
Build retrieval-augmented generation systems entirely on-device:
- Embedding models: BERT, BGE-M3, ModernBERT for document vectorization
- Rerankers: ModernBERT, XLM-RoBERTa for result refinement
- LLM: Your choice of mlx-lm compatible model for synthesis
Process sensitive documents—legal contracts, medical records, proprietary code—without ever sending tokens to third-party APIs. The tiered cache means repeated queries against the same document base stay blazing fast.
4. Agent Swarms and MCP Integration
With Model Context Protocol (MCP) support, oMLX becomes the inference backbone for sophisticated agent systems. Install via:
/opt/homebrew/opt/omlx/libexec/bin/pip install mcp
Then configure tool providers in mcp.json and let your agents orchestrate multiple capabilities through a single, locally-hosted API.
Step-by-Step Installation & Setup Guide
Option 1: macOS App (Recommended for Most Users)
The zero-friction path:
- Download
.dmgfrom GitHub Releases - Drag oMLX to Applications
- Launch—Welcome screen guides model directory, server start, first download
- Auto-update handles future versions
Note: The app doesn't install CLI commands. Use Homebrew or source install for terminal access.
Option 2: Homebrew (Power User Friendly)
# Add tap and install
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
# Keep current
brew update && brew upgrade omlx
# Run as persistent background service
brew services start omlx # Auto-restarts on crash
Service defaults: ~/.omlx/models directory, port 8000. Customize via environment variables or run once with flags to persist to ~/.omlx/settings.json.
Log locations:
- Service stdout/stderr:
$(brew --prefix)/var/log/omlx.log - Structured app logs:
~/.omlx/logs/server.log
Option 3: From Source (Developers & Contributors)
# Clone repository
git clone https://github.com/jundot/omlx.git
cd omlx
# Core installation
pip install -e .
# With MCP support
pip install -e ".[mcp]"
# Development dependencies
pip install -e ".[dev]"
pytest -m "not slow" # Run test suite
System requirements: macOS 15.0+ (Sequoia), Python 3.10+, Apple Silicon (M1/M2/M3/M4).
Post-Install: First Model Setup
Create your model directory and populate with MLX-format models:
mkdir -p ~/.omlx/models
# Download from HuggingFace, or use built-in admin dashboard downloader
Supported organization:
~/.omlx/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
├── bge-m3/
└── mlx-community/
└── some-model/
REAL Code Examples from the Repository
Example 1: Basic Server Launch with Custom Memory Limits
The foundation of oMLX operation—start serving with production-grade resource controls:
# Start with explicit model directory and memory constraints
omlx serve \
--model-dir ~/models \
--max-model-memory 32GB \
--max-process-memory 80% \
--max-concurrent-requests 16
What's happening here:
--model-dirpoints to your MLX model collection--max-model-memorycaps individual model footprint (prevents one giant model from monopolizing)--max-process-memorysets absolute ceiling at 80% of system RAM—critical safety valve on macOS where swap performance degrades catastrophically--max-concurrent-requeststunes throughput vs. latency tradeoff; 16 is aggressive for smaller models, conservative for 70B+
Example 2: Enabling the Full Tiered Cache Stack
Unlock oMLX's signature feature—persistent KV cache across RAM and SSD:
omlx serve \
--model-dir ~/models \
--paged-ssd-cache-dir ~/.omlx/cache \
--hot-cache-max-size 20%
Deep dive: This configuration activates both cache tiers. The --paged-ssd-cache-dir enables cold tier persistence in safetensors format—efficient, recoverable, and format-compatible with the broader ML ecosystem. The --hot-cache-max-size 20% reserves one-fifth of system RAM for the hot tier; tune this based on your typical context lengths and model count.
Critical insight: The cold tier isn't mere swap. When a new request shares prefix tokens with cached conversation history, oMLX performs selective restoration—only needed blocks come back from SSD, not entire sequences. This is the architectural difference that makes multi-turn conversations practical.
Example 3: Homebrew Service with Environment Customization
For the "set it and forget it" background operation:
# Configure via environment before service start
export OMLX_MODEL_DIR=/Volumes/External/models
export OMLX_PORT=8080
export OMLX_API_KEY=your-production-key
# Persist and launch
brew services start omlx
# Verify operation
brew services info omlx
Why this pattern matters: Environment variables inject configuration without modifying plist files directly. The service wrapper captures these at launch, writes to ~/.omlx/settings.json, and subsequent brew services restart operations maintain your preferences. For API key authentication, combine --api-key with admin panel localhost verification bypass for secure local-only operation.
Example 4: MCP-Enabled Server with Tool Configuration
For agent and tool-use scenarios:
# Install MCP support (Homebrew path shown)
/opt/homebrew/opt/omlx/libexec/bin/pip install mcp
# Launch with tool configuration
omlx serve \
--model-dir ~/models \
--mcp-config mcp.json \
--max-concurrent-requests 8
The mcp.json structure defines available tools—file system access, database queries, web search— that your models can invoke through structured function calling. oMLX's tool parser auto-detects formats across major model families (Llama, Qwen, DeepSeek, Gemma, Mistral, and more), routing XML or JSON tool markup appropriately.
Example 5: Python Client Integration Pattern
Connect any OpenAI-compatible client to your local server:
from openai import OpenAI
# Point to local oMLX instance
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed-for-local" # Or your configured key
)
# Streaming chat completion with automatic prefix caching
response = client.chat.completions.create(
model="Step-3.5-Flash-8bit", # Or your configured alias
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain this function's time complexity..."}
],
stream=True,
stream_options={"include_usage": True} # oMLX supports this!
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
if chunk.usage:
print(f"\n[Tokens: {chunk.usage.total_tokens}]")
Key compatibility note: The stream_options.include_usage flag—often unsupported in local servers—works fully in oMLX, enabling accurate cost tracking and context management in client applications.
Advanced Usage & Best Practices
Memory Tuning for Your Specific Mac
Apple Silicon unified memory means no CPU/GPU copy overhead, but also no dedicated VRAM protection. Follow this hierarchy:
- Measure baseline: Run
omlx benchmarkfrom admin panel with your target models - Set process ceiling:
--max-process-memoryat RAM minus 8-12GB for macOS overhead - Model-specific limits:
--max-model-memoryprevents one model from evicting everything else - Monitor swap: If
memory_pressureshows yellow/red, reduce hot cache or model count
Context Length Optimization
Long contexts are where oMLX's tiered cache shines, but require tuning:
- Hot cache size: Increase
--hot-cache-max-sizefor repetitive long-document queries - SSD quality: NVMe SSDs (built-in on all modern Macs) handle cold tier well; external USB drives degrade performance
- Prefix patterns: Structure prompts to maximize shared prefixes—system prompts first, then document context, then varying queries
Production Deployment Patterns
For team or CI/CD usage:
- API key + localhost bypass: Secure multi-user access without network exposure
- Homebrew service + launchd: Survives reboots, auto-recovers from crashes
- External model directory: Symlink or mount
~/modelsfrom network storage for shared model libraries
Comparison with Alternatives
| Feature | oMLX | llama.cpp | Ollama | vLLM (MLX) |
|---|---|---|---|---|
| Native macOS UI | ✅ Menu bar app | ❌ CLI only | ✅ Basic app | ❌ CLI only |
| Tiered KV Cache | ✅ RAM + SSD | ❌ RAM only | ❌ RAM only | ❌ RAM only |
| Continuous Batching | ✅ Built-in | ⚠️ Limited | ❌ Sequential | ✅ Yes |
| Multi-Model Serving | ✅ Simultaneous | ❌ One at a time | ⚠️ Switch only | ⚠️ Complex |
| API Compatibility | ✅ OpenAI + Anthropic | ⚠️ Partial | ✅ OpenAI | ✅ OpenAI |
| VLM Support | ✅ Full pipeline | ⚠️ Variable | ✅ Limited | ❌ No |
| MCP/Tool Calling | ✅ Native | ❌ Manual | ❌ No | ❌ No |
| Auto-Model Download | ✅ Admin dashboard | ❌ Manual | ✅ Pull command | ❌ Manual |
| Memory Safety | ✅ Hard limits | ❌ OOM possible | ⚠️ Swap heavy | ❌ OOM possible |
When to choose oMLX: You want maximum performance on Apple Silicon without sacrificing usability. The tiered cache and multi-model management are genuinely unique—no competitor persists KV state across RAM/SSD boundaries with this polish.
When others make sense: llama.cpp for cross-platform deployment (Linux/Windows), Ollama for absolute beginners on any OS, vLLM for multi-GPU Linux clusters.
FAQ
Q: Does oMLX work on Intel Macs? No—Apple Silicon only (M1/M2/M3/M4). The MLX framework requires Neural Engine and unified memory architecture.
Q: Can I run models larger than my RAM? Yes, through quantization (4-bit, 8-bit) and the tiered cache system. A 70B model at 4-bit fits in ~35GB; with SSD caching for KV, multi-turn conversations remain viable on 36GB MacBook Pros.
Q: How does this compare to cloud API costs? At typical usage (Claude Code daily, document analysis weekly), break-even against Claude Pro ($20/mo) or GPT-4 API occurs in 2-3 months. For heavy users, savings reach thousands annually.
Q: Is my data really private? Completely. Zero network calls for inference (optional update checks can be disabled). Models run locally; no telemetry in the open-source build.
Q: Can I use oMLX with my existing OpenAI client libraries?
Yes—drop-in replacement. Change base_url to http://localhost:8000/v1 and optionally set api_key. All standard endpoints work.
Q: What about model availability?
Any MLX-format model from HuggingFace works. The admin dashboard includes one-click downloader for popular models. Two-level paths like mlx-community/model-name/ are supported.
Q: How do I contribute or report issues? See the Contributing Guide. The project welcomes bug fixes, performance optimizations, and documentation improvements.
Conclusion: Your Mac Deserves Better Than Cloud Compromise
oMLX isn't just another local LLM tool—it's a reclamation of computational sovereignty. For too long, developers accepted the narrative that serious AI work requires cloud GPUs, API keys, and recurring bills. The reality? Your Apple Silicon Mac, properly unleashed, rivals mid-range cloud instances for individual and small-team workflows.
The tiered KV cache transforms ephemeral local inference into stateful, persistent AI infrastructure. The menu bar integration proves that power doesn't require complexity. And the relentless focus on macOS-native experience—PyObjC, not Electron; auto-update, not manual pulls; service management, not terminal daemons—shows what happens when a tool is built for a platform rather than ported to it.
I've evaluated dozens of inference servers. oMLX is the first that made me cancel a cloud API subscription without looking back. For coding assistants, document analysis, private RAG, and agent prototyping, it's simply unmatched on Apple Silicon.
Stop renting your intelligence. Own it.
👉 Get oMLX on GitHub — Download the app, brew install omlx, or build from source. Your menu bar is about to become the most powerful AI control panel you've ever used.
Comments (0)
No comments yet. Be the first to share your thoughts!