Tools & UIs Reference
Tools & UIs Reference
Last verified: April 2026. The tools used to run and interact with local LLMs evolve constantly. Below is the current functional state of top engines and User Interfaces (UIs).
The Core Engines (The Backend)
Engines do the heavy lifting: loading models into memory, determining GPU offloading, and exposing an API (usually OpenAI-compatible) so other apps can chat with them. You rarely interact with the engine directly unless you’re using the terminal.
1. Ollama
- Best for: Most users starting out. The standard starting point.
- Key Features: One-command installs (
ollama pull), MLX hardware acceleration for Macs, native tool calling, and simple setup. - New for 2026: Integrates
ollama launchfor instant sandboxed configuration of coding agents. - Platforms: Mac, Windows, Linux.
Quick Setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
The model name (llama3.2) comes from the Models reference page. Swap in any model listed there.
2. llama.cpp / llama-server
- Best for: Hardware tinkerers, low-resource machines, and those wanting support for the latest quantization formats (like TurboQuant 3-bit).
- Key Features: A C/C++ backend optimized for low-level hardware access. If your hardware is unusual, llama.cpp probably supports it. It powers almost every other GUI behind the scenes.
- Platforms: Everything.
Quick Setup
# Mac
brew install llama.cpp
# Linux: build from source. See https://github.com/ggerganov/llama.cpp
llama-cli -m path/to/model.gguf
3. MLX Server (Apple Silicon Only)
- Best for: Mac users wanting maximum tokens-per-second out of Apple Unified Memory. Faster than llama.cpp on M-series for many models.
- Platforms: macOS with Apple Silicon (M1 and later).
Quick Setup
pip install mlx-lm
mlx_lm.server --model mlx-community/model-name
Replace model-name with any model available at mlx-community on HuggingFace.
4. vLLM
- Best for: Serving models at scale on NVIDIA GPUs. Not intended for local single-user use.
- Key Features: Production inference server with an OpenAI-compatible API. Supports tensor parallelism across multiple GPUs via a single flag. PagedAttention gives it higher throughput than naive serving.
- Platforms: Linux with NVIDIA GPUs.
Quick Setup
pip install vllm
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3
Graphical Interfaces (The Desktop Apps)
If you don’t want to use the terminal, you want a GUI. These apps bundle an engine (usually llama.cpp) inside a chat interface.
1. LM Studio
- Best for: Individuals wanting a polished, out-of-the-box desktop experience.
- Key Features: Visual model downloading via HuggingFace, system specification checks (warns you if a model is too big), and LM Link for secure remote GPU usage.
- Platforms: Mac, Windows, Linux.
- Download: lmstudio.ai
Quick Setup
Download from lmstudio.ai. Open the app, go to the model browser, search for a model, and click Download. No terminal required.
2. Open WebUI
- Best for: Power users, organizations, and developers.
- Key Features: Runs in a browser via Docker. Granular Role-Based Access Control (RBAC), multi-user logins, integrated document RAG, and custom modular “Pipelines” for routing and prompt policies.
- Platforms: Self-hosted (Docker).
- Download: openwebui.com
Quick Setup
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
This connects to a running Ollama instance on the same machine. Open http://localhost:3000 in your browser.
3. Unsloth Studio
- Best for: Users who want to fine-tune models directly from their GUI.
- Key Features: Merges the chat experience with LoRA fine-tuning workflows, letting you adjust model behavior visually.
4. AnythingLLM
- Best for: Users who want to chat with their own documents without setting up a full RAG pipeline manually.
- Key Features: All-in-one desktop app for document Q&A (RAG). Supports multiple LLM backends including Ollama. No Docker required. Point it at a folder of PDFs or text files and start asking questions.
- Download: anythingllm.com
Low-Code Workflow UIs
If you want to orchestrate RAG pipelines or agents visually without writing Python:
Dify
- Best for: Production deployments and teams that want a managed platform.
- How it works: Cloud-hosted option available, also self-hostable. Built-in LLM orchestration, prompt management, and API publishing. Lets you deploy a working app from a visual flow without writing server code.
- Site: dify.ai
Langflow
- Best for: Rapid prototyping on your own machine.
- How it works: Self-hosted only. Python-based drag-and-drop flow builder. Lets you wire together document loaders, vector databases, LLMs, and tool-callers into a pipeline you can test locally. Not designed for production traffic.
- Site: langflow.org
Specialized Assistants
OpenClaw
- Best for: Users who want AI integrated into their operating system as a background personal assistant.
- How it works: Connects to services like WhatsApp and Slack, monitors for triggers, and can execute actions on your behalf. Generates and stores reusable “skills” locally so it doesn’t repeat the same reasoning twice.
Summary Comparison
| Tool | Type | Multi-User | GPU Tier | Target Use |
|---|---|---|---|---|
| Ollama | Backend Engine | No | Consumer | Developers / most users |
| LM Studio | Desktop App | No | Consumer | Solo users / researchers |
| Open WebUI | Web UI | Yes | Consumer | Teams / organizations |
| Unsloth Studio | Web / Desktop | No | Consumer | Fine-tuners |
| AnythingLLM | Desktop App | No | Consumer | Document Q&A |
| vLLM | Backend Engine | Yes | NVIDIA (multi-GPU) | Production inference |
| MLX Server | Backend Engine | No | Apple Silicon | Mac inference |
| llama.cpp | Backend Engine | No | Any | Tinkerers / embedded |
Related Pages
- Module 1: Getting Started: Ollama installation walkthrough.
- Module 4: Chat Interfaces and Document Q&A: Setting up Open WebUI and AnythingLLM for everyday use.
- Module 8: Ecosystem Setup: Connecting engines, UIs, and tools into a working local stack.