Expose Your Local FastAPI Server to the Internet with Cloudflare Tunnel & Docker

February 3, 2026 by

Administrator

| No comments yet

Introduction

Large Language Models (LLMs) have revolutionized how we interact with AI, but running them typically requires cloud services or expensive hardware. However, with Docker, you can run powerful LLMs locally on your own machine, giving you complete control, privacy, and cost savings.

In this comprehensive guide, we'll walk through setting up a basic LLM using Docker, making it accessible even for developers new to containerization.

Prerequisites

System Requirements

RAM: Minimum 8GB (16GB+ recommended)
Storage: At least 20GB free space
CPU: Modern multi-core processor
GPU: Optional but recommended for better performance

Software Requirements

Docker Desktop (latest version)
Basic command line knowledge
Text editor (VS Code recommended)

Step 1: Install Docker

For Windows:

# Download Docker Desktop from official website
# https://www.docker.com/products/docker-desktop

# After installation, verify:
docker --version
docker-compose --version

For macOS:

# Using Homebrew
brew install --cask docker

# Or download from official website
# Verify installation
docker --version

For Linux (Ubuntu/Debian):

# Update package index
sudo apt-get update

# Install Docker
sudo apt-get install docker.io docker-compose

# Add user to docker group
sudo usermod -aG docker $USER

# Verify installation
docker --version

Step 2: Choose Your LLM

Popular open-source LLMs you can run locally:

Ollama: Easy-to-use, supports Llama 2, Mistral, CodeLlama
LM Studio: User-friendly GUI, multiple model support
LocalAI: OpenAI-compatible API, self-hosted
llama.cpp: C++ implementation, very fast

Step 3: Setup Ollama with Docker

We'll use Ollama as it's beginner-friendly and well-documented.

Pull Ollama Docker Image

# Pull the official Ollama image
docker pull ollama/ollama:latest

# Verify the image
docker images | grep ollama

Run Ollama Container

# Basic run command
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# For GPU support (NVIDIA)
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Verify Ollama is Running

# Check container status
docker ps

# Check Ollama API
curl http://localhost:11434/api/tags

Step 4: Download and Run Your First Model

Pull Llama 2 Model (7B)

# Execute command inside container
docker exec -it ollama ollama pull llama2

# For smaller model (faster, less accurate)
docker exec -it ollama ollama pull llama2:7b

# For larger model (slower, more accurate)
docker exec -it ollama ollama pull llama2:13b

Run Interactive Chat

# Start chatting with the model
docker exec -it ollama ollama run llama2

# Example conversation:
# >>> Hello! Tell me about Docker
# >>> Write a Python function to reverse a string

Step 5: Create Docker Compose Setup

For easier management, create a docker-compose.yml file:

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-llm
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    # Uncomment for GPU support
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

volumes:
  ollama_data:

Run with Docker Compose

# Start services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Step 6: Use the LLM via API

Python Example

import requests
import json

def chat_with_llm(prompt):
    url = "http://localhost:11434/api/generate"
    
    data = {
        "model": "llama2",
        "prompt": prompt,
        "stream": False
    }
    
    response = requests.post(url, json=data)
    return response.json()['response']

# Example usage
result = chat_with_llm("Explain Docker in simple terms")
print(result)

JavaScript/Node.js Example

const axios = require('axios');

async function chatWithLLM(prompt) {
    const response = await axios.post('http://localhost:11434/api/generate', {
        model: 'llama2',
        prompt: prompt,
        stream: false
    });
    
    return response.data.response;
}

// Example usage
chatWithLLM('What is machine learning?')
    .then(result => console.log(result))
    .catch(error => console.error(error));

cURL Example

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Step 7: Build a Simple Web Interface

Create a simple HTML interface to interact with your LLM:

<!DOCTYPE html>
<html>
<head>
    <title>Local LLM Chat</title>
    <style>
        body { font-family: Arial; max-width: 800px; margin: 50px auto; }
        #chat { height: 400px; border: 1px solid #ccc; padding: 10px; overflow-y: auto; }
        #input { width: 80%; padding: 10px; }
        #send { padding: 10px 20px; }
    </style>
</head>
<body>
    <h1>🤖 Local LLM Chat</h1>
    <div id="chat"></div>
    <input id="input" placeholder="Type your message...">
    <button id="send">Send</button>
    
    <script>
        const chat = document.getElementById('chat');
        const input = document.getElementById('input');
        
        document.getElementById('send').onclick = async () => {
            const message = input.value;
            chat.innerHTML += `<p><strong>You:</strong> ${message}</p>`;
            
            const response = await fetch('http://localhost:11434/api/generate', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({
                    model: 'llama2',
                    prompt: message,
                    stream: false
                })
            });
            
            const data = await response.json();
            chat.innerHTML += `<p><strong>AI:</strong> ${data.response}</p>`;
            input.value = '';
            chat.scrollTop = chat.scrollHeight;
        };
    </script>
</body>
</html>

Available Models and Their Use Cases

Model	Size	Use Case	RAM Required
Llama 2 7B	3.8GB	General chat, code	8GB
Llama 2 13B	7.3GB	Advanced reasoning	16GB
Mistral 7B	4.1GB	Fast, accurate	8GB
CodeLlama 7B	3.8GB	Code generation	8GB
Phi-2	1.7GB	Lightweight tasks	4GB

Performance Optimization Tips

1. Use GPU Acceleration

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

2. Allocate More Resources

# Increase Docker memory limit
# Docker Desktop → Settings → Resources → Memory: 8GB+

# For docker run:
docker run -d -m 8g --cpus="4" ollama/ollama

3. Use Quantized Models

Quantized models are smaller and faster with minimal accuracy loss:

# Pull 4-bit quantized model
docker exec -it ollama ollama pull llama2:7b-q4_0

Common Issues and Solutions

Issue 1: Out of Memory Error

Solution: Use a smaller model or increase Docker memory allocation.

Issue 2: Slow Response Times

Solution: Enable GPU support or use quantized models.

Issue 3: Container Won't Start

# Check Docker logs
docker logs ollama

# Remove and recreate container
docker rm -f ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Issue 4: API Connection Refused

# Check if container is running
docker ps

# Check port mapping
docker port ollama

# Test API
curl http://localhost:11434/api/tags

Advanced: Multi-Model Setup

Run multiple models simultaneously:

version: '3.8'

services:
  ollama-chat:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_chat:/root/.ollama
    environment:
      - OLLAMA_MODELS=llama2,mistral
      
  ollama-code:
    image: ollama/ollama
    ports:
      - "11435:11434"
    volumes:
      - ollama_code:/root/.ollama
    environment:
      - OLLAMA_MODELS=codellama

volumes:
  ollama_chat:
  ollama_code:

Security Best Practices

Don't expose ports publicly: Keep 11434 on localhost only
Use authentication: Add a reverse proxy with auth (Nginx, Traefik)
Regular updates: Keep Docker and images updated
Monitor resources: Use docker stats to track usage
Data privacy: Your data stays local, never sent to cloud

Monitoring and Management

Check Resource Usage

# Real-time stats
docker stats ollama

# Container resource limits
docker inspect ollama | grep -A 10 Resources

View Model List

docker exec -it ollama ollama list

Remove Models

docker exec -it ollama ollama rm llama2:13b

Conclusion

Running LLMs locally with Docker gives you complete control, privacy, and cost savings. While cloud-based solutions are convenient, local deployment is ideal for:

Privacy-sensitive applications
Offline development
Cost optimization for high-volume usage
Learning and experimentation
Custom model fine-tuning

Need Help Setting Up AI Solutions? 🚀

Our team specializes in AI implementation, Docker deployments, and custom LLM integrations!

Get Expert Help

Next Steps

Explore model fine-tuning with your own data
Build production-ready applications with LLMs
Implement RAG (Retrieval Augmented Generation)
Create custom AI agents and workflows
Learn about model quantization techniques

Useful Resources

in Our blog

Administrator February 3, 2026