Skip to Content

Expose Your Local FastAPI Server to the Internet with Cloudflare Tunnel & Docker

February 3, 2026 by
Expose Your Local FastAPI Server to the Internet with Cloudflare Tunnel & Docker
Administrator
| No comments yet

Introduction

Large Language Models (LLMs) have revolutionized how we interact with AI, but running them typically requires cloud services or expensive hardware. However, with Docker, you can run powerful LLMs locally on your own machine, giving you complete control, privacy, and cost savings.

In this comprehensive guide, we'll walk through setting up a basic LLM using Docker, making it accessible even for developers new to containerization.

Prerequisites

System Requirements

  • RAM: Minimum 8GB (16GB+ recommended)
  • Storage: At least 20GB free space
  • CPU: Modern multi-core processor
  • GPU: Optional but recommended for better performance

Software Requirements

  • Docker Desktop (latest version)
  • Basic command line knowledge
  • Text editor (VS Code recommended)

Step 1: Install Docker

For Windows:

# Download Docker Desktop from official website
# https://www.docker.com/products/docker-desktop

# After installation, verify:
docker --version
docker-compose --version

For macOS:

# Using Homebrew
brew install --cask docker

# Or download from official website
# Verify installation
docker --version

For Linux (Ubuntu/Debian):

# Update package index
sudo apt-get update

# Install Docker
sudo apt-get install docker.io docker-compose

# Add user to docker group
sudo usermod -aG docker $USER

# Verify installation
docker --version

Step 2: Choose Your LLM

Popular open-source LLMs you can run locally:

  • Ollama: Easy-to-use, supports Llama 2, Mistral, CodeLlama
  • LM Studio: User-friendly GUI, multiple model support
  • LocalAI: OpenAI-compatible API, self-hosted
  • llama.cpp: C++ implementation, very fast

Step 3: Setup Ollama with Docker

We'll use Ollama as it's beginner-friendly and well-documented.

Pull Ollama Docker Image

# Pull the official Ollama image
docker pull ollama/ollama:latest

# Verify the image
docker images | grep ollama

Run Ollama Container

# Basic run command
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# For GPU support (NVIDIA)
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Verify Ollama is Running

# Check container status
docker ps

# Check Ollama API
curl http://localhost:11434/api/tags

Step 4: Download and Run Your First Model

Pull Llama 2 Model (7B)

# Execute command inside container
docker exec -it ollama ollama pull llama2

# For smaller model (faster, less accurate)
docker exec -it ollama ollama pull llama2:7b

# For larger model (slower, more accurate)
docker exec -it ollama ollama pull llama2:13b

Run Interactive Chat

# Start chatting with the model
docker exec -it ollama ollama run llama2

# Example conversation:
# >>> Hello! Tell me about Docker
# >>> Write a Python function to reverse a string

Step 5: Create Docker Compose Setup

For easier management, create a docker-compose.yml file:

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-llm
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    # Uncomment for GPU support
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

volumes:
  ollama_data:

Run with Docker Compose

# Start services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Step 6: Use the LLM via API

Python Example

import requests
import json

def chat_with_llm(prompt):
    url = "http://localhost:11434/api/generate"
    
    data = {
        "model": "llama2",
        "prompt": prompt,
        "stream": False
    }
    
    response = requests.post(url, json=data)
    return response.json()['response']

# Example usage
result = chat_with_llm("Explain Docker in simple terms")
print(result)

JavaScript/Node.js Example

const axios = require('axios');

async function chatWithLLM(prompt) {
    const response = await axios.post('http://localhost:11434/api/generate', {
        model: 'llama2',
        prompt: prompt,
        stream: false
    });
    
    return response.data.response;
}

// Example usage
chatWithLLM('What is machine learning?')
    .then(result => console.log(result))
    .catch(error => console.error(error));

cURL Example

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Step 7: Build a Simple Web Interface

Create a simple HTML interface to interact with your LLM:

<!DOCTYPE html>
<html>
<head>
    <title>Local LLM Chat</title>
    <style>
        body { font-family: Arial; max-width: 800px; margin: 50px auto; }
        #chat { height: 400px; border: 1px solid #ccc; padding: 10px; overflow-y: auto; }
        #input { width: 80%; padding: 10px; }
        #send { padding: 10px 20px; }
    </style>
</head>
<body>
    <h1>🤖 Local LLM Chat</h1>
    <div id="chat"></div>
    <input id="input" placeholder="Type your message...">
    <button id="send">Send</button>
    
    <script>
        const chat = document.getElementById('chat');
        const input = document.getElementById('input');
        
        document.getElementById('send').onclick = async () => {
            const message = input.value;
            chat.innerHTML += `<p><strong>You:</strong> ${message}</p>`;
            
            const response = await fetch('http://localhost:11434/api/generate', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({
                    model: 'llama2',
                    prompt: message,
                    stream: false
                })
            });
            
            const data = await response.json();
            chat.innerHTML += `<p><strong>AI:</strong> ${data.response}</p>`;
            input.value = '';
            chat.scrollTop = chat.scrollHeight;
        };
    </script>
</body>
</html>

Available Models and Their Use Cases

ModelSizeUse CaseRAM Required
Llama 2 7B3.8GBGeneral chat, code8GB
Llama 2 13B7.3GBAdvanced reasoning16GB
Mistral 7B4.1GBFast, accurate8GB
CodeLlama 7B3.8GBCode generation8GB
Phi-21.7GBLightweight tasks4GB

Performance Optimization Tips

1. Use GPU Acceleration

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

2. Allocate More Resources

# Increase Docker memory limit
# Docker Desktop → Settings → Resources → Memory: 8GB+

# For docker run:
docker run -d -m 8g --cpus="4" ollama/ollama

3. Use Quantized Models

Quantized models are smaller and faster with minimal accuracy loss:

# Pull 4-bit quantized model
docker exec -it ollama ollama pull llama2:7b-q4_0

Common Issues and Solutions

Issue 1: Out of Memory Error

Solution: Use a smaller model or increase Docker memory allocation.

Issue 2: Slow Response Times

Solution: Enable GPU support or use quantized models.

Issue 3: Container Won't Start

# Check Docker logs
docker logs ollama

# Remove and recreate container
docker rm -f ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Issue 4: API Connection Refused

# Check if container is running
docker ps

# Check port mapping
docker port ollama

# Test API
curl http://localhost:11434/api/tags

Advanced: Multi-Model Setup

Run multiple models simultaneously:

version: '3.8'

services:
  ollama-chat:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_chat:/root/.ollama
    environment:
      - OLLAMA_MODELS=llama2,mistral
      
  ollama-code:
    image: ollama/ollama
    ports:
      - "11435:11434"
    volumes:
      - ollama_code:/root/.ollama
    environment:
      - OLLAMA_MODELS=codellama

volumes:
  ollama_chat:
  ollama_code:

Security Best Practices

  • Don't expose ports publicly: Keep 11434 on localhost only
  • Use authentication: Add a reverse proxy with auth (Nginx, Traefik)
  • Regular updates: Keep Docker and images updated
  • Monitor resources: Use docker stats to track usage
  • Data privacy: Your data stays local, never sent to cloud

Monitoring and Management

Check Resource Usage

# Real-time stats
docker stats ollama

# Container resource limits
docker inspect ollama | grep -A 10 Resources

View Model List

docker exec -it ollama ollama list

Remove Models

docker exec -it ollama ollama rm llama2:13b

Conclusion

Running LLMs locally with Docker gives you complete control, privacy, and cost savings. While cloud-based solutions are convenient, local deployment is ideal for:

  • Privacy-sensitive applications
  • Offline development
  • Cost optimization for high-volume usage
  • Learning and experimentation
  • Custom model fine-tuning

Need Help Setting Up AI Solutions? 🚀

Our team specializes in AI implementation, Docker deployments, and custom LLM integrations!

Get Expert Help

Next Steps

  • Explore model fine-tuning with your own data
  • Build production-ready applications with LLMs
  • Implement RAG (Retrieval Augmented Generation)
  • Create custom AI agents and workflows
  • Learn about model quantization techniques

Useful Resources

Expose Your Local FastAPI Server to the Internet with Cloudflare Tunnel & Docker
Administrator February 3, 2026
Share this post
Tags
Our blogs
Archive
Sign in to leave a comment
How to Setup a Basic LLM Model via Docker on Your Local Machine