Introduction
Large Language Models (LLMs) have revolutionized how we interact with AI, but running them typically requires cloud services or expensive hardware. However, with Docker, you can run powerful LLMs locally on your own machine, giving you complete control, privacy, and cost savings.
In this comprehensive guide, we'll walk through setting up a basic LLM using Docker, making it accessible even for developers new to containerization.
Prerequisites
System Requirements
- RAM: Minimum 8GB (16GB+ recommended)
- Storage: At least 20GB free space
- CPU: Modern multi-core processor
- GPU: Optional but recommended for better performance
Software Requirements
- Docker Desktop (latest version)
- Basic command line knowledge
- Text editor (VS Code recommended)
Step 1: Install Docker
For Windows:
# Download Docker Desktop from official website # https://www.docker.com/products/docker-desktop # After installation, verify: docker --version docker-compose --version
For macOS:
# Using Homebrew brew install --cask docker # Or download from official website # Verify installation docker --version
For Linux (Ubuntu/Debian):
# Update package index sudo apt-get update # Install Docker sudo apt-get install docker.io docker-compose # Add user to docker group sudo usermod -aG docker $USER # Verify installation docker --version
Step 2: Choose Your LLM
Popular open-source LLMs you can run locally:
- Ollama: Easy-to-use, supports Llama 2, Mistral, CodeLlama
- LM Studio: User-friendly GUI, multiple model support
- LocalAI: OpenAI-compatible API, self-hosted
- llama.cpp: C++ implementation, very fast
Step 3: Setup Ollama with Docker
We'll use Ollama as it's beginner-friendly and well-documented.
Pull Ollama Docker Image
# Pull the official Ollama image docker pull ollama/ollama:latest # Verify the image docker images | grep ollama
Run Ollama Container
# Basic run command docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama # For GPU support (NVIDIA) docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Verify Ollama is Running
# Check container status docker ps # Check Ollama API curl http://localhost:11434/api/tags
Step 4: Download and Run Your First Model
Pull Llama 2 Model (7B)
# Execute command inside container docker exec -it ollama ollama pull llama2 # For smaller model (faster, less accurate) docker exec -it ollama ollama pull llama2:7b # For larger model (slower, more accurate) docker exec -it ollama ollama pull llama2:13b
Run Interactive Chat
# Start chatting with the model docker exec -it ollama ollama run llama2 # Example conversation: # >>> Hello! Tell me about Docker # >>> Write a Python function to reverse a string
Step 5: Create Docker Compose Setup
For easier management, create a docker-compose.yml file:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-llm
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
# Uncomment for GPU support
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
volumes:
ollama_data:
Run with Docker Compose
# Start services docker-compose up -d # View logs docker-compose logs -f # Stop services docker-compose down
Step 6: Use the LLM via API
Python Example
import requests
import json
def chat_with_llm(prompt):
url = "http://localhost:11434/api/generate"
data = {
"model": "llama2",
"prompt": prompt,
"stream": False
}
response = requests.post(url, json=data)
return response.json()['response']
# Example usage
result = chat_with_llm("Explain Docker in simple terms")
print(result)
JavaScript/Node.js Example
const axios = require('axios');
async function chatWithLLM(prompt) {
const response = await axios.post('http://localhost:11434/api/generate', {
model: 'llama2',
prompt: prompt,
stream: false
});
return response.data.response;
}
// Example usage
chatWithLLM('What is machine learning?')
.then(result => console.log(result))
.catch(error => console.error(error));
cURL Example
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Step 7: Build a Simple Web Interface
Create a simple HTML interface to interact with your LLM:
<!DOCTYPE html>
<html>
<head>
<title>Local LLM Chat</title>
<style>
body { font-family: Arial; max-width: 800px; margin: 50px auto; }
#chat { height: 400px; border: 1px solid #ccc; padding: 10px; overflow-y: auto; }
#input { width: 80%; padding: 10px; }
#send { padding: 10px 20px; }
</style>
</head>
<body>
<h1>🤖 Local LLM Chat</h1>
<div id="chat"></div>
<input id="input" placeholder="Type your message...">
<button id="send">Send</button>
<script>
const chat = document.getElementById('chat');
const input = document.getElementById('input');
document.getElementById('send').onclick = async () => {
const message = input.value;
chat.innerHTML += `<p><strong>You:</strong> ${message}</p>`;
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama2',
prompt: message,
stream: false
})
});
const data = await response.json();
chat.innerHTML += `<p><strong>AI:</strong> ${data.response}</p>`;
input.value = '';
chat.scrollTop = chat.scrollHeight;
};
</script>
</body>
</html>
Available Models and Their Use Cases
| Model | Size | Use Case | RAM Required |
|---|---|---|---|
| Llama 2 7B | 3.8GB | General chat, code | 8GB |
| Llama 2 13B | 7.3GB | Advanced reasoning | 16GB |
| Mistral 7B | 4.1GB | Fast, accurate | 8GB |
| CodeLlama 7B | 3.8GB | Code generation | 8GB |
| Phi-2 | 1.7GB | Lightweight tasks | 4GB |
Performance Optimization Tips
1. Use GPU Acceleration
# Install NVIDIA Container Toolkit distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker
2. Allocate More Resources
# Increase Docker memory limit # Docker Desktop → Settings → Resources → Memory: 8GB+ # For docker run: docker run -d -m 8g --cpus="4" ollama/ollama
3. Use Quantized Models
Quantized models are smaller and faster with minimal accuracy loss:
# Pull 4-bit quantized model docker exec -it ollama ollama pull llama2:7b-q4_0
Common Issues and Solutions
Issue 1: Out of Memory Error
Solution: Use a smaller model or increase Docker memory allocation.
Issue 2: Slow Response Times
Solution: Enable GPU support or use quantized models.
Issue 3: Container Won't Start
# Check Docker logs docker logs ollama # Remove and recreate container docker rm -f ollama docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Issue 4: API Connection Refused
# Check if container is running docker ps # Check port mapping docker port ollama # Test API curl http://localhost:11434/api/tags
Advanced: Multi-Model Setup
Run multiple models simultaneously:
version: '3.8'
services:
ollama-chat:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_chat:/root/.ollama
environment:
- OLLAMA_MODELS=llama2,mistral
ollama-code:
image: ollama/ollama
ports:
- "11435:11434"
volumes:
- ollama_code:/root/.ollama
environment:
- OLLAMA_MODELS=codellama
volumes:
ollama_chat:
ollama_code:
Security Best Practices
- Don't expose ports publicly: Keep 11434 on localhost only
- Use authentication: Add a reverse proxy with auth (Nginx, Traefik)
- Regular updates: Keep Docker and images updated
- Monitor resources: Use docker stats to track usage
- Data privacy: Your data stays local, never sent to cloud
Monitoring and Management
Check Resource Usage
# Real-time stats docker stats ollama # Container resource limits docker inspect ollama | grep -A 10 Resources
View Model List
docker exec -it ollama ollama list
Remove Models
docker exec -it ollama ollama rm llama2:13b
Conclusion
Running LLMs locally with Docker gives you complete control, privacy, and cost savings. While cloud-based solutions are convenient, local deployment is ideal for:
- Privacy-sensitive applications
- Offline development
- Cost optimization for high-volume usage
- Learning and experimentation
- Custom model fine-tuning
Need Help Setting Up AI Solutions? 🚀
Our team specializes in AI implementation, Docker deployments, and custom LLM integrations!
Next Steps
- Explore model fine-tuning with your own data
- Build production-ready applications with LLMs
- Implement RAG (Retrieval Augmented Generation)
- Create custom AI agents and workflows
- Learn about model quantization techniques
Useful Resources
- Ollama Official Documentation
- Docker Documentation
- Hugging Face Model Hub
- Community forums and Discord servers