Sources
Sources
Every module in this curriculum is built on the work of researchers, engineers, bloggers, and communities who publish openly. This page collects the sources we used, organized by module.
Module 1: Get Running
Tools
- Ollama GitHub
- Ollama Install Guide
- LM Studio
- Unsloth Studio Docs
- Open WebUI GitHub and Documentation
Models
- Qwen3.5 GitHub and HuggingFace Collection
- Qwen3.5-9B beats GPT-OSS-120B — VentureBeat
- GPT-OSS announcement — OpenAI and GitHub
- NVIDIA Nemotron 3 — Developer Blog
- Gemma 3n — DeepMind
Hardware and Emerging Techniques
- TurboQuant — Google Research
- LLM in a Flash — Apple ML Research
- Local LLM Hardware Requirements: Mac vs PC 2026 — SitePoint
General Guides
Module 2: Choose Wisely
Training Quality vs Model Size
- Why Data Quality, Not Model Size, Will Decide LLM Performance — Hurix
- How Smaller Phi-4 Beats Giants — AI ML Etc
- How to align open LLMs with DPO & synthetic data — Phil Schmid
Quantization
- AI Model Quantization Guide: Q4_K_M, Q8, GGUF — Local AI Zone
- GGUF quantization guide — Toni Sagrista
- Unsloth Dynamic 2.0 GGUFs — Unsloth Docs
- GGUF vs GPTQ vs AWQ Compared — Local AI Master
Model Naming
Inference Engines
Memory Bandwidth
- Memory Bandwidth: How Does It Boost Tokens per Second — Hardware Corner
- DDR5 Speed, CPU and LLM Inference — DEV Community
Mixture of Experts
- Mixture of Experts Explained — HuggingFace Blog
- Applying MoE in LLM Architectures — NVIDIA Technical Blog
Module 3A: How Models Think
Transformer Architecture
- Attention Is All You Need — Vaswani et al., 2017 — The original transformer paper.
- The Illustrated Transformer — Jay Alammar
- The Annotated Transformer — Harvard NLP
- 3Blue1Brown: But what is a GPT?
Attention Mechanism
- Multi-Head Attention explained — Lilian Weng
- Understanding Attention in Transformers — Sebastian Raschka
Tokenization
- Byte Pair Encoding — HuggingFace NLP Course
- Let’s build the GPT Tokenizer — Andrej Karpathy
- SentencePiece — Kudo & Richardson, 2018
Embeddings
Scaling Laws
Temperature & Sampling
- How to generate text: decoding methods — HuggingFace Blog
- The Curious Case of Neural Text Degeneration — Holtzman et al., 2019
General
Module 3B: How Models Fit
Quantization
KV Cache & Memory
- Understanding the KV Cache — Hugging Face
- PagedAttention — Kwon et al., 2023 — vLLM’s KV cache memory management.
- Multi-head Latent Attention — DeepSeek-V2
TurboQuant
- TurboQuant — Google Research
- TurboQuant — llama.cpp Discussion #20969
- A Simple Explanation of the Key Idea Behind TurboQuant — r/LocalLLaMA
- TurboQuant Research Paper Walkthrough — Darshan Fofadiya
LLM in a Flash
Alternative Architectures
- Mamba: Linear-Time Sequence Modeling — Gu & Dao, 2023
- Gated DeltaNet
- NVIDIA Nemotron 3 Hybrid Architecture — Developer Blog
Inference Optimization
- Speculative Decoding — Leviathan et al., 2023
- Grouped-Query Attention — Ainslie et al., 2023
- Flash Attention 2 — Dao, 2023
Module 4: What Can You Do With This?
Chat Interfaces
Document Q&A / RAG
- AnythingLLM
- PrivateGPT
- Khoj
- Retrieval-Augmented Generation — Lewis et al., 2020
- Building RAG-based LLM Applications — NVIDIA
Coding Agents
Agent Concepts
- ReAct: Synergizing Reasoning and Acting — Yao et al., 2022
- Toolformer — Schick et al., 2023
- Model Context Protocol — Anthropic
Agent Security
- OWASP Top 10 for LLM Applications — Excessive agency, insecure output handling, prompt injection.
- Prompt injection and AI agent risks — Simon Willison
- AI assistant security risks — Trail of Bits
Automation
Module 5: Agents
Foundations
- ReAct: Synergizing Reasoning and Acting — Yao et al., 2022
- Toolformer — Schick et al., 2023
- LLM Powered Autonomous Agents — Lilian Weng
- Building Effective Agents — Anthropic
- A Survey on LLM-based Autonomous Agents — Wang et al., 2023
Agent Frameworks
Coding Agents
Multi-Agent Orchestration
Reliability and Security
- OWASP Top 10 for LLM Applications
- Prompt injection risks — Simon Willison
- AI assistant security — Trail of Bits
No-Code Platforms
Module 6: Build Custom Tools
APIs and SDKs
- Anthropic SDK — Python and TypeScript
- OpenAI SDK — Python and Node.js
- Google Generative AI SDK
- Ollama API Documentation
- Ollama OpenAI Compatibility
Function Calling / Tool Use
MCP (Model Context Protocol)
RAG (Retrieval-Augmented Generation)
- Retrieval-Augmented Generation — Lewis et al., 2020
- Building RAG Pipelines — NVIDIA
- ChromaDB
- Qdrant
- pgvector — PostgreSQL Extension
- Nomic Embed Text — Nomic AI
- Chunking Strategies for RAG — Pinecone
Testing and Evaluation
Module 7: Local + Cloud
Decision Frameworks
- Local vs Cloud LLMs — Practical Guide 2026 — Prem AI
- When to Use Local vs API Models — Simon Willison
Cost Optimization
Model Routing
Pricing References
Module 8: Supercharge Your Setup
Skills and Plugins
- everything-claude-code — GitHub — 100+ skills, 28 agents, 59 slash commands
- Claude Code Skills Documentation — Anthropic
MCP Ecosystem
- MCP Specification
- mcp.so — Community MCP Directory
- Smithery — MCP Marketplace
- Glama MCP Directory
- Anthropic MCP Repository
IDE Configuration
Community
Module 9: Go Further
Image Generation
- FLUX.1 — Black Forest Labs
- Stable Diffusion XL — Stability AI
- ComfyUI — GitHub
- Draw Things — App Store
- Forge — GitHub
- CivitAI — Community Models and LoRAs
Video Generation
Audio and TTS
Music Generation
Benchmarks and Evaluation
- Chatbot Arena — LMSYS
- MMLU — Hendrycks et al., 2020
- HumanEval — Chen et al., 2021
- SWE-bench
- LiveCodeBench
- Open LLM Leaderboard — HuggingFace
Fine-Tuning
- LoRA — Hu et al., 2021
- QLoRA — Dettmers et al., 2023
- Unsloth — Fine-tuning Platform
- Unsloth Studio Documentation
- HuggingFace Instruction Tuning Datasets
Module 10: What’s Next
Compression Research
- TurboQuant — Google Research
- TurboQuant — Zandieh et al., 2025
- QuIP#: 2-bit Quantization of Large Language Models — Cornell/Meta
On-Device Inference
Speculative Decoding
Alternative Architectures
- Mamba: Linear-Time Sequence Modeling — Gu & Dao, 2023
- Mixture of Experts Explained — HuggingFace Blog
- NVIDIA Nemotron 3 Hybrid Architecture — Developer Blog