🛠️ DGX Spark Field Notes

arm64 + Blackwell gotchas, from a production self-hosted stack

The DGX Spark is brand-new hardware on an unusual axis: Grace Blackwell GB10, arm64, ~128 GB unified memory, NVIDIA DGX OS. Almost the entire ML tooling stack quietly assumes the opposite — x86_64, a discrete CUDA GPU, and separate VRAM. Where those assumptions break, the errors are cryptic and barely documented yet. These are the ones we actually hit running a live avatar + video-generation stack on the box, each with the symptom, the real cause, and the fix that worked. If you're standing up your own Spark, this is the page we wish had existed.

Companion to the trilingual overview at spark.html. Hardware-specific; no secrets or host details here.

1 · Why these bite #

Three properties of the Spark cause nearly every surprise. Keep them in mind and most errors become predictable:

① It's arm64

Mainstream Docker images and a chunk of PyPI wheels are linux/amd64 only. They either won't pull, or pull and fail at import with a missing symbol.

② It's Blackwell (very new)

Compute capability sm_121. Fresh pip wheels may target a CUDA runtime the base image doesn't ship, so a library installs cleanly but can't load.

③ Unified memory

No discrete VRAM. Tools that report or assume a separate GPU memory pool behave oddly — sometimes that's a real bug, sometimes it's normal and you must learn to ignore it.

The single most useful habit: start from an NVIDIA multi-arch base image (nvcr.io/nvidia/pytorch:25.04-py3) and treat its pre-tuned PyTorch as sacred — never let pip overwrite it.

2 · Building containers on arm64 #

Mainstream ComfyUI / SD images won't runimage

Cause

Popular images (ai-dock/comfyui, yanwk/comfyui-boot, frankjoshua/comfyui) ship linux/amd64 only. There's no arm64 manifest, so they don't run on the Spark.

Fix

Build your own image FROM nvcr.io/nvidia/pytorch:25.04-py3 — it's multi-arch and pre-tuned for Blackwell. Confirm both arches exist before you commit to it:

docker manifest inspect nvcr.io/nvidia/pytorch:25.04-py3 | grep architecture
# expect both "amd64" and "arm64"

Pulling from nvcr.io needs a (free) NGC login: docker login nvcr.io with username literally $oauthtoken and your NGC key as the password.

Container crash-loops on libcudart.so.13: cannot open shared object filebuild

Symptom

The image builds fine, then the container dies immediately at startup with an OSError about a missing libcudart.so.13 (or another CUDA .so with a version higher than your base ships).

Cause

A requirements.txt lists torch / torchvision / torchaudio unpinned. pip install -r happily fetches fresh wheels built against a newer CUDA (e.g. CUDA 13), clobbering the base image's Blackwell-tuned PyTorch (CUDA 12.x runtime). The new wheel then can't find the CUDA 13 runtime that isn't there.

Fix

Filter the torch family out of every requirements.txt before installing, so the base PyTorch is preserved and only the other deps get added:

# in the Dockerfile, for ComfyUI and each custom node's requirements
RUN grep -vE '^(torch|torchvision|torchaudio)([=<>!~]|$)' requirements.txt \
      > requirements.filtered.txt \
 && pip install --no-cache-dir -r requirements.filtered.txt

ComfyUI startup dies on import torchaudiobuild

Cause

The NVIDIA arm64 PyTorch base ships without torchaudio, but ComfyUI imports it at module load (via the LTX/audio-VAE path). You can't just install it: the CUDA-13 wheel pulls a missing libcudart.so.13, and a version-pinned wheel hits a torch ABI mismatch (undefined symbol: _ZNK5torch8autograd4Node4nameEv).

Fix

The bare import torchaudio is the only thing that runs unless you exercise the audio-VAE path — so install a tiny stub package. Add this to the Dockerfile before EXPOSE:

RUN pip uninstall -y torchaudio 2>/dev/null || true \
 && SITE=$(python -c 'import site; print(site.getsitepackages()[0])') \
 && mkdir -p "$SITE/torchaudio" \
 && printf '__version__ = "0.0.0-stub"\n' > "$SITE/torchaudio/__init__.py"

Custom nodes referencing xformers, or shipping prebuilt CUDA .so filesimage

Cause

The NVIDIA base ships its own optimized attention kernels — xformers is unnecessary and its wheels fight your base. Separately, node packs (e.g. the original IP-Adapter) ship CUDA extensions compiled for x86_64; on arm64 they silently fail to load, and the node just doesn't appear.

Fix

Launch ComfyUI with --use-pytorch-cross-attention instead of expecting xformers. For node packs, install the pure-Python forks where they exist; treat any pack that compiles a .so as suspect until proven on arm64.

3 · GPU access inside containers #

A running container suddenly loses the GPU: CUDA error: operation not permittedruntime

Symptom

A container that was working starts failing on a trivial CUDA op (e.g. torch.embedding during text-encode). Inside it, nvidia-smi returns Failed to initialize NVML: Unknown Error — but the host nvidia-smi is perfectly healthy. The triviality of the failing op is the tell; a real OOM looks different.

Cause

Classic NVIDIA-container + systemd-cgroup bug. A host systemctl daemon-reload (fired inadvertently by a package install or service edit) rewrites the device cgroup and revokes the running container's access to /dev/nvidia*. The container keeps running on a stale CUDA context; new kernel launches are denied.

Fix — immediate

docker restart <container>   # re-injects device access
docker exec <container> nvidia-smi --query-gpu=name --format=csv,noheader   # → NVIDIA GB10

Fix — permanent

Switch Docker's cgroup driver from systemd to cgroupfs (cgroup v2 + the systemd driver is the exact bug config). Edit /etc/docker/daemon.jsonmerge the key, don't clobber the existing runtimes.nvidia block or you kill GPU for every container:

// /etc/docker/daemon.json
{
  "runtimes": { "nvidia": { ... keep this exactly ... } },
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
}
sudo systemctl restart docker
# verify the fix survives the trigger:
sudo systemctl daemon-reload
docker exec <container> nvidia-smi --query-gpu=name --format=csv,noheader   # still → NVIDIA GB10

nvidia-smi shows [N/A] for memory used/total inside the containernot a bug

Cause / verdict

This is normal on the GB10. Unified memory means there's no discrete VRAM figure to report, so memory.used / memory.total come back [N/A]. Don't chase it — it is not the cgroup bug above (that one fails NVML entirely, not just the memory fields).

4 · Shipping code from Windows to the Spark #

If your dev box is Windows and the Spark is Linux, the line-ending and permission-bit gaps bite at the worst time — inside a container ENTRYPOINT, where the error is opaque.

/usr/bin/env: 'bash\r': No such file or directoryCRLF

Symptom

A shell script that's obviously correct refuses to run on the Spark, complaining about bash\r. Worse, a script COPY'd into an image makes the container's ENTRYPOINT die on start.

Cause

Git on Windows with core.autocrlf=true smudges text files to CRLF on export — including git archive | tar. The committed blob can be LF and you'll still ship \r\n. (Verifying with grep $'\r' in MSYS gives false negatives — don't trust it.)

Fix

Pin line endings at the repo level so archive/checkout always emit LF regardless of autocrlf:

# .gitattributes
*.sh       text eol=lf
Dockerfile text eol=lf

Verify with a byte count, not grep:

python -c "print(open('script.sh','rb').read().count(b'\r\n'))"   # must be 0

Scripts arrive without the executable bitperms

Cause

git archive | tar drops the executable bit on extract, so ./run.sh fails with permission denied.

Fix

Either chmod +x *.sh after extract, or commit the mode so it travels: git update-index --chmod=+x script.sh (git stores files as 100755).

5 · Public exposure with Tailscale Funnel #

Tailscale Funnel gives a free *.ts.net HTTPS hostname with no inbound port on the home router — the cleanest way to share a Spark service. (Cloudflare Tunnel's free tier needs an apex domain; a subdomain zone wants a Business plan. AWS has no native equivalent.) Running it in Docker has three non-obvious requirements.

tailscale container restart-loops (~64 times) and never authenticatesauth

Cause

The default tailscale/tailscale entrypoint (containerboot) enforces a 60-second auth-completion timer. Interactive Funnel auth (you click a URL) routinely exceeds it, and the container crash-loops.

Fix

Bypass containerboot — run tailscaled directly and do tailscale up as a separate step:

docker run -d --name tailscale --network=host \
  --cap-add=NET_ADMIN --device=/dev/net/tun:/dev/net/tun \
  -v tailscale-state:/var/lib/tailscale --restart unless-stopped \
  --entrypoint /usr/local/bin/tailscaled \
  tailscale/tailscale:latest
docker exec tailscale tailscale up        # prints the auth URL — click once
docker exec tailscale tailscale funnel --bg 7861

A second service needs its own hostname — second daemon won't stay upmulti

Cause

Two tailscaled daemons on --network=host fight over the default TUN device tailscale0; the loser restart-loops with "device or resource busy." They also share the default socket and state path.

Fix

Give the second daemon its own TUN device, socket, and state volume:

docker run -d --name tailscale-2 --hostname my-second-service \
  --network=host --cap-add=NET_ADMIN --device=/dev/net/tun:/dev/net/tun \
  -v tailscale-2-state:/var/lib/tailscale --restart unless-stopped \
  --entrypoint /usr/local/bin/tailscaled \
  tailscale/tailscale:latest \
  --tun=tailscale1 \
  --socket=/var/run/tailscale-2.sock \
  --state=/var/lib/tailscale/tailscaled.state
# every CLI call into THIS daemon must name its socket:
docker exec tailscale-2 tailscale --socket=/var/run/tailscale-2.sock up

Device shows "Connected" in admin, but phones off-tailnet can't reach it410 ghost

Symptom

Funnel reports "on", the admin UI shows the device healthy with a valid TLS cert — yet the public *.ts.net hostname never resolves for clients off the tailnet (e.g. a phone on cellular). On-tailnet clients work fine (they use MagicDNS and bypass the funnel relay).

Cause

A transient register request: http 410: auth path not found during the first auth round-trip (visible in docker logs) admits the device just enough to look healthy, but the control plane never finishes pushing the per-device capability bundle — so public DNS is never published.

Diagnose

Compare the capability map against a working sibling device — no CLI surfaces this gap, you must read the raw JSON:

docker exec tailscale tailscale status --json | python -c "import sys,json; print(sorted(json.load(sys.stdin)['Self']['CapMap']))"
# broken device is missing https, default-auto-update, most URL-prefixed caps

Recover

Neither tailscale up --reset nor a restart fixes it — the bad state is server-side. You must re-register: remove the device in the Tailscale admin, wipe local identity (docker rm -f the container and docker volume rm its state volume), recreate, tailscale up (click the new URL), then tailscale funnel --bg. Allow 30–90s for DNS + TLS to warm up at the relays.

6 · Day-to-day operations #

SSH drops with exit 255 whenever the GPU is under heavy loadops

Cause

Sustained generation (e.g. video diffusion at 720p) saturates the GB10 and the power envelope spikes; the SSH session gets starved and drops (ssh exit 255).

Fix

Don't rely on SSH to drive long GPU jobs. Submit work over the service's HTTP API through the Funnel instead (e.g. ComfyUI's /prompt + /history + /view) — no SSH in the loop. If you must SSH, use a long ConnectTimeout (60s) and run jobs detached (nohup/docker exec -d) so they survive a dropped session.

Permission denied writing into a bind-mounted models/ directoryops

Cause

The container runs as root, so host-mounted directories it created are root:root. An ssh … wget from your normal user can't write into them.

Fix

Download through the container so the write happens as root, and detach it so it survives an SSH drop:

docker exec -d <container> bash -c \
  "cd /workspace/ComfyUI/models && wget -c <url> > /workspace/ComfyUI/output/dl.log 2>&1"
# log to a bind-mounted dir so you can tail it from the host

Video generated on the Spark plays on desktop but Android refuses it silentlymobile

Cause

Default ffmpeg output uses a profile/level many mobile decoders reject without an error — the <video> just stays black.

Fix

Constrain the encode for broad mobile decode support:

ffmpeg -i in.mp4 -c:v libx264 -profile:v main -level 4.0 \
  -pix_fmt yuv420p -movflags +faststart out.mp4

(Same family of gotcha as in-app players that won't decode high/4:4:4 streams — keep to main@4.0 + yuv420p.)

Found this useful or have a Spark gotcha of your own? This page is part of an open, evolving field guide. The matching reproducible bring-up (Docker definitions for the whole stack) lives alongside it in the project repo. Corrections and additions welcome via the channels below.