# gpu-tools — vast.ai cold-start helpers for OpenCode on Host B

**Created:** 2026-05-26
**Topology mirrors DL-0012** (Host A reverse SSH tunnel) — no cookies, no proxies, no
cloudflare-tunnel layer. The GPU box dials Host B, forwards its Ollama port to a
loopback on Host B, OpenCode hits loopback. Auth = SSH keys, transport = SSH.

## Files

| Path | Purpose |
|---|---|
| `/opt/gpu-tools/gpu-up` | provision a vast.ai instance + bind to OpenCode |
| `/opt/gpu-tools/gpu-down` | destroy active instance + roll OpenCode back to Host A |
| `/opt/gpu-tools/gpu-status` | show current state (instance, tunnel, opencode endpoint) |
| `/opt/gpu-tools/README.md` | this file |
| `/usr/local/bin/gpu-{up,down,status}` | symlinks to above |
| `/root/.ssh/vast_provisioning_ed25519` | private key — vast.ai instance uses it to dial Host B |
| `/root/.ssh/vast_provisioning_ed25519.pub` | public key — already in `/home/vasttun/.ssh/authorized_keys` |
| `/var/lib/specker/vast-current.json` | state of the active instance (if any) |
| `/home/opencode/.config/opencode/opencode.json.host-a-backup` | created on first `gpu-up`; restored on `gpu-down` |

## Host B side (already done, do not redo)

- system user `vasttun` (nologin, uid 994)
- `/home/vasttun/.ssh/authorized_keys`:
  `restrict,port-forwarding,permitlisten="127.0.0.1:11440" <provisioning pubkey>`
- `/etc/ssh/sshd_config.d/110-vasttun.conf`:
  ```
  Match User vasttun
      AllowTcpForwarding remote
      X11Forwarding no
      AllowAgentForwarding no
      PermitTTY no
  ```
- `vastai` CLI 1.0.13 (pipx, available at `/root/.local/bin/vastai`)

## One-time setup before first use

```
vastai set api-key <YOUR_VAST_API_KEY>
```

Get the key from https://cloud.vast.ai/account/ → "API keys" → "Generate".
It is persisted to `/root/.config/vastai/vast_api_key` — survives reboots.

## Usage

### Cold-start a GPU and wire it to OpenCode

```
gpu-up                                  # defaults: qwen3.6:35b-a3b on RTX_4090, <$0.50/h, 80GB
gpu-up --model qwen3-coder:30b          # different model
gpu-up --gpu RTX_5090 --max-price 0.80  # bigger card, higher budget
gpu-up --disk 120                       # bigger disk if model is huge
gpu-up --dry-run                        # show what would happen, do not create
```

`gpu-up` will:
1. Pick the cheapest verified offer that matches.
2. Launch a `ollama/ollama:latest` container with an onstart script that:
   - installs `autossh`,
   - starts `ollama serve`,
   - `ollama pull <model>`,
   - opens a reverse SSH tunnel `Host B :11440 ← gpu-box :11434` (port-forwarding-only, key-restricted).
3. Wait up to 15 min for the tunnel to appear on Host B (cold pull takes 1-3 min on a 1Gbit box).
4. Rewrite `/home/opencode/.config/opencode/opencode.json` to point at `http://127.0.0.1:11440/v1`.
5. Restart `opencode-server.service`.
6. Save state to `/var/lib/specker/vast-current.json`.

### Shut down

```
gpu-down
```

Destroys the vast.ai instance (billing stops), restores OpenCode's Host A config
from backup, restarts opencode-server. Idempotent.

### Inspect

```
gpu-status
```

Shows: state file, tunnel listen, current OpenCode model + baseURL, raw `vastai show
instance` output.

## Tuning defaults

Every flag is also an env var, so you can set per-shell defaults or systemd-environment:

```
GPU_FILTER=RTX_4090   # any gpu_name vast.ai uses (e.g. RTX_5090, A100_SXM4, H100_NVL)
MAX_PRICE=0.50        # USD/hour, strict ceiling
DISK_GB=80            # provisioned disk
MODEL=qwen3.6:35b-a3b
TUNNEL_PORT=11440     # change requires also updating permitlisten in vasttun's authorized_keys
HOST_B_PUBLIC_IP=YOUR_HOST_B_PUBLIC_IP
HOST_B_SSH_PORT=2222
```

## Failure modes (and how to recover)

| Symptom | Cause | Fix |
|---|---|---|
| `vastai not authenticated` | API key not set | `vastai set api-key <KEY>` |
| `no offers matching` | filter too tight (price/disk/gpu) | raise `--max-price`, change `--gpu`, reduce `--disk` |
| `tunnel did not come up in 15 min` | onstart failed inside instance | `vastai logs <id> -t 200` — common: image lacks autossh because apt was offline; pull failed (small disk); SSH dial blocked (vast.ai egress restrictions on some hosts) |
| `gpu-up` exits but model not in /api/tags | pull still running | wait 1-2 min; or `curl http://127.0.0.1:11440/api/tags` to poll |
| OpenCode says `model is required` | similar — pull not done OR opencode hit cold model and timed out | `gpu-status`, retry |
| Want to switch model on the same instance | `ssh` into the vast.ai box (via `vastai ssh <id>`), `ollama pull <new>`, then edit opencode.json manually OR just `gpu-down && gpu-up --model <new>` |

## Security notes

- The provisioning private key lives in `/root/.ssh/` and is **also embedded into the
  onstart-cmd** of every instance (vast.ai stores it server-side in instance metadata
  for the lifetime of the instance). If that leaks, the worst an attacker can do is
  forward `127.0.0.1:11440` on Host B — same loopback our own setup uses. They cannot
  shell into Host B (`restrict` + `nologin`).
- If you ever need stronger isolation: rotate the key with
  `ssh-keygen -t ed25519 -f /root/.ssh/vast_provisioning_ed25519 -N "" -C "vast@host-b" -q`
  and update `/home/vasttun/.ssh/authorized_keys` with the new pubkey. All future
  `gpu-up` calls will use the new key automatically; running instance keeps the old
  one until destroyed.

## Origin

This setup was built 2026-05-26 after the manual cloudflare-quick-tunnel + cookie-auth
path proved too fragile (cookies expire, disks fill, web-UI hides Ollama behind an
auth layer that OpenAI-compatible SDKs cannot speak). Reverse SSH tunnel replicates
the proven Host A pattern (DL-0012) for ephemeral GPU instances.

## 2026-05-26 lesson — why image matters

First attempt used `--image ollama/ollama:latest`. It failed with:

```
/.launch: line 48: ssh: command not found
```

repeating forever. Reason: vast.ai runs its own `/.launch` overlay before the
container runs your `onstart-cmd`. That overlay needs `ssh` (client) for portal
connectivity. Minimal images like `ollama/ollama:*` do NOT have it → vast hangs
in a loop, onstart never executes, tunnel never comes up, you pay for an idle
instance.

Default image is now **`pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel`** (ssh-client
present, CUDA libs included). Ollama is installed in onstart via the official
shell installer. Cold-start budget grew ~+1 min (image pull is bigger) but it
actually works.

If you want a smaller image, options that include ssh-client:
- `nvidia/cuda:12.4.1-cudnn-devel-ubuntu24.04` (devel, not runtime!)
- `vastai/base:cuda-12.4.x-*` (vast.ai managed)
- any `*-devel` ubuntu-based ML image

`runtime` variants are usually too minimal — check before switching.

## 2026-05-26 addendum — pause / resume

For short breaks (lunch, overnight) you do **not** want `gpu-down` — that destroys
the storage, and the next cold-start has to pull the 22GB model again (5-8 min).

`gpu-pause` calls `vastai stop instance` instead:
- container stops → **compute billing stops** (~$0.78/h → ~$0.001-0.005/h storage-only)
- storage is preserved (ollama binary, model blobs, ssh keys all stay)
- public SSH addr may change after resume — that is fine, our tunnel dials *out*
- OpenCode is rolled back to Host A so it stops hitting a dead :11440

`gpu-resume` calls `vastai start instance`:
- container starts again → onstart re-runs (apt/ollama already there → fast; model
  already pulled → `ollama pull` instantly sees the manifest)
- typical resume = ~30-60s instead of 5-8 min cold-start
- tunnel comes back up → opencode flips back to vast.ai

State file (`/var/lib/specker/vast-current.json`) tracks `state: running|paused`.

```
gpu-up        # full cold-start (~3-8 min, $0.78/h after)
gpu-pause     # short break — storage kept, $0.005/h
gpu-resume    # ~30s warm-start, $0.78/h
gpu-down      # forever — destroys instance + storage
```

When to use which:
- **few minutes break, expect to come back same day** → just leave it running (cheap enough)
- **few hours / overnight** → `gpu-pause`
- **done for the day or longer** → `gpu-down`
