Ollama & Stable Diffusion AI Guide#
1 HP1GPU - Docker (Ollama Model Inventory)#
1.1 llama3.2:latest (2.0 GB)#
- Consuming VM: HP1Docker
- Services:
karakeep(Docker)paperless(Docker)n8n(Docker)
1.2 qwen3-vl:8b (6.1 GB)#
- Consuming VM: HP1Docker
- Service:
paperless-ai(Docker)
- Service:
1.3 gemma3:12b (8.1 GB)#
- Consuming VM: HP1GPU
- Service:
open-webui(Docker)
1.4 llama3.2-vision:latest (7.8 GB)#
- Consuming VM: HP1Docker
- Service:
paperlessInactive (Commented out in compose)
1.5 qwen2.5vl:3b (3.2 GB)#
- Consuming VM: HP1Docker
- Service:
paperless(Testing) - Status: Inactive (Commented out in compose)
1.6 glm-ocr:latest (2.2 GB)#
1.7 gemma4::e4b (9.6 GB)#
1.8 nomic-embed-text:latest (274 MB)#
- Service: Internal Embeddings / RAG
- Status: Active (Implicit)
2 Ollama Installation & Network Setup#
2.1 Basic Installation#
# Open firewall port
ufw allow 11434/tcp# Official Install Script
curl -fsSL [https://ollama.com/install.sh](https://ollama.com/install.sh) | sh2.2 Enable Local Network Access#
To allow other devices to use the AI, you must change the bind address from 127.0.0.1 to 0.0.0.0.
# Edit service configuration
systemctl edit ollama.serviceAdd the following block:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_MODELS=/mnt/ollama/"# Reload and Restart
systemctl daemon-reload && systemctl restart ollama3 Update#
# Add description
curl -fsSL https://ollama.com/install.sh | sh# Add description
sudo systemctl daemon-reload
sudo systemctl restart ollama4 Multi-GPU Configuration (Tesla T4 & P4)#
If running two different GPUs, you can run two separate Ollama instances on different ports.
4.1 Instance 1: Tesla T4 (16GB) (Port 11434)#
Set CUDA_VISIBLE_DEVICES=0 in the main service.
4.2 Instance 2: Tesla P4 (8GB) (Port 11435)#
Create a second service:
sudo nano /etc/systemd/system/ollama-p4.service[Unit]
Description=Ollama Service (Tesla P4)
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11435"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_MODELS=/mnt/ollama/"
Environment="CUDA_VISIBLE_DEVICES=1"
[Install]
WantedBy=default.targetsudo systemctl daemon-reload
sudo systemctl enable --now ollama-p44.3 Verify status#
journalctl -u ollama.service -n 50 --no-pager
journalctl -u ollama-p4.service -n 50 --no-pager5 LLM Management (Ollama CLI)#
5.1 Basic Commands#
# Pull and Run models
ollama pull llama3.2
ollama run llama3.2# List and Remove
ollama ls
ollama rm gemma:7b# Verify GPU usage during a chat
ollama ps5.2 Models Library#
- Lightweight:
llama3.2,gemma3:4b,ministral-3:3b - Performance:
gemma3:12b,qwen3:8b,mistral:7b - Coding:
qwen2.5-coder:7b
6 Custom Model Storage (External Disk)#
6.1 Format and Mount Partition#
# Format disk with 'ollama' label
mkfs.ext4 /dev/sdb1 -L "ollama"
mkdir -p /mnt/ollama
mount /dev/sdb1 /mnt/ollama
chown -R ollama:ollama /mnt/ollama/6.2 Persistence (/etc/fstab)#
UUID=7fee698e-0940-4b26-8faf-3bf764f8a643 /mnt/ollama ext4 defaults,nofail 0 26.3 Optimization#
# Set reserved blocks to 0%
tune2fs -m 0 /dev/sdb17 Stable Diffusion WebUI Installation#
Requires NVIDIA drivers 550+ and ~10GB disk space.
7.1 Install Dependencies & Python 3.10#
apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev git google-perftools# Setup Pyenv for specific Python version
curl [https://pyenv.run](https://pyenv.run) | bash
# (Update .bashrc with export paths provided by script)
pyenv install 3.10
pyenv global 3.107.2 Download & Start#
mkdir ~/stablediffusion && cd ~/stablediffusion
wget -q [https://raw.githubusercontent.com/AUTOMATIC1111/stable-diffusion-webui/master/webui.sh](https://raw.githubusercontent.com/AUTOMATIC1111/stable-diffusion-webui/master/webui.sh)
chmod +x webui.sh
./webui.sh7.3 Systemd Service Setup#
nano /usr/lib/systemd/system/stablediffusion.service
[Unit]
Description=Stable Diffusion Webui Service
After=network-online.target
[Service]
ExecStart=/home/marc/stablediffusion/stable-diffusion-webui/webui.sh --listen --api
User=marc
Restart=always
RestartSec=3
[Install]
WantedBy=default.targetsystemctl daemon-reload
systemctl enable --now stablediffusion7.4 Add Checkpoint (Model)#
cd ~/stablediffusion/stable-diffusion-webui/models/Stable-diffusion/
wget [https://huggingface.co/stabilityai/stable-diffusion-2-inpainting/resolve/main/512-inpainting-ema.ckpt](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting/resolve/main/512-inpainting-ema.ckpt)7.5 Enable Obsidian in Ollama (Mac Studio)#
launchctl setenv OLLAMA_ORIGINS "app://obsidian.md*"Restart the application (close from menu bar)
8 Setting Up a Short-Context Qwen Model in Ollama#
To run the qwen3.6:35b model with a tight, fast 32k memory footprint that leaves plenty of RAM for Xcode and your iOS simulators, follow these quick steps in your terminal.
8.1 Create the Configuration File#
Run the following command to automatically generate a Modelfile in your current directory with the context parameter locked to 32k (32,768 tokens):
cat << 'EOF' > Modelfile
FROM qwen3.6:35b
PARAMETER num_ctx 32768
EOF8.2 Build the New Model Variant#
Next, tell Ollama to compile a new, distinct model using that configuration file. We will name this model variant qwen3.6-35b-32k:
ollama create qwen3.6-35b-32k -f ./Modelfile8.3 Verify and Run#
You can now immediately start up and test your new short-context model directly from the command line:
ollama run qwen3.6-35b-32kđź’ˇ Tip: Now you can hook your development tools (like Cursor or Pi) straight to
qwen-35b-32k. Ollama will safely run this version with a capped memory footprint without you ever needing to touch or restart the global Ollama Desktop UI!
9 Setting Up a Short-Context Gemma 4 31B Model in Ollama#
To run the dense gemma4:31b model with a restricted 32k context window—preventing it from hitting your Mac Studio’s VRAM threshold and keeping your execution fast—follow these terminal steps.
9.1 Create the Configuration File#
Run the following command to automatically generate a Modelfile in your current directory with the context parameter locked to 32k (32,768 tokens):
cat << 'EOF' > Modelfile
FROM gemma4:31b
PARAMETER num_ctx 32768
EOF9.2 Build the New Model Variant#
Next, tell Ollama to compile a new, distinct model using that configuration file. We will name this model variant gemma4-31b-32k:
ollama create gemma4-31b-32k -f ./Modelfile9.3 Step 3: Verify and Run#
You can now immediately start up and test your new short-context dense model directly from the command line:
ollama run gemma4-31b-32kđź’ˇ Tip: By routing your coding workspace to
gemma4-31b-32k, the model is hard-capped from expanding its KV cache into your Mac’s system swap memory. This keeps the dense 31B reasoning sharp and fast for focused coding tasks.
10 Codex#
10.1 Use Ollama with Codex an Pi#
ollama launch codex-app --model qwen3.6:27bpi --provider ollama-local --model qwen3.6:27b