Multi-Character · Multi-Engine · 多角色 · 多引擎

The Deterministic AI Video Pipeline

ขั้นตอนผลิตวิดีโอแบบมืออาชีพ — หลายตัวละคร หลายเครื่องมือ

专业级 AI 视频生产流程 — 多角色多引擎

Claude Skills · Nano Banana Pro · Higsfield + Seedance 2.0 · ~$10–25 per minute

Multi-character pro AI video pipeline hero visualization

The next tier up ขั้นถัดไป进阶版本

English

The basic pipeline covers one character, single-engine, ~$2.50 per video. This page documents the architecture you reach for when that breaks down: multiple characters dancing in unison, exact lip-sync, multi-shot continuity, broadcast-grade post.

If a pipeline survives synthetic K-pop, it survives anything.

ไทย

Pipeline พื้นฐาน รองรับตัวละครเดียว เครื่องยนต์เดียว ~$2.50 ต่อวิดีโอ หน้านี้บันทึกสถาปัตยกรรมที่คุณต้องใช้เมื่อแบบนั้นไม่พอ: หลายตัวละครเต้นพร้อมกัน lip-sync เป๊ะ ความต่อเนื่องข้ามช็อต และ post ระดับออกอากาศ

ถ้า pipeline ผ่านมิวสิควิดีโอ K-pop สังเคราะห์ได้ ก็ผ่านทุกอย่างได้

中文

基础流程支持单角色、单引擎,每个视频约 $2.50。本页记录的是当基础流程不够用时的进阶架构:多角色同步舞蹈、精准对口型、跨镜头连贯性、广播级后期。

如果一个流程能搞定合成 K-pop,就能搞定任何项目。

The reference workflow comes from creator "JOEY" (Joey Edits, ~225k subs), who built a synthetic K-pop music video — chosen specifically because K-pop is the hardest case for generative AI: synchronised choreography, uniform aesthetic, exact lip-sync across multiple band members. If a pipeline survives K-pop it survives anything.

The shape is the same six-stage architecture as the basic build, with three architectural upgrades layered on top:

Claude "Skills" replace ad-hoc prompts — two specialised system prompts ("Banana Pro Director" + "Cinema Worldbuilder") that emit locked character sheets and cinematography language deterministically.
Vision feedback loop — every generated still is fed back to Claude with vision input, which flags drift and rewrites the next prompt before any motion credit is spent.
Dual motion engines — Seedance 2.0 for physics + lip-sync, Higsfield for cinematic VFX and localised inpainting. Each clip routes to whichever engine handles its hardest constraint.

Reference cost

Genvid's sci-fi production The Seeker used the Higsfield-based pipeline at roughly 1/500th the cost of their previous animated work (Silent Hill scale). That number is the macro story — the page below is the mechanics that get you there.

Pipeline at a glance ภาพรวม流程一览

Stage	Output	Primary tool	Programmatic path
1. Orchestration	Locked character sheets + cinematography prompts	Claude Skills	Anthropic API + cached system prompts
2. Scene planning	Per-shot duration budget	Claude	JSON shot list, sum to clip credits
3. Static stills	4K reference frames, up to 5 characters	Nano Banana Pro	`gemini-3-pro-image-preview`
3b. Vision check	Drift report → prompt rewrite	Claude Sonnet 4.6	image input → JSON corrections
4. Motion	5–10s clips, physics + lip-sync	Seedance 2.0 / Higsfield	Replicate · Higsfield API
5. Assembly	Multi-timeline cut + motion graphics	CapCut PC	ffmpeg + drawtext (alt)
6. Master	4K H.265 delivery file	Aiarty / VideoProc	ffmpeg + Real-ESRGAN (alt)

STAGE 01

Claude "Skills" — Director & Worldbuilder ระบบ AI ผู้กำกับClaude 技能 — 导演与世界构建器

Claude Sonnet 4.6 prompt caching JOEY's "skills"

The creator describes building two custom Claude "Skills" over two weeks. In API terms a Skill is a long system prompt that pins behaviour and output schema. Cache it once and every subsequent character/scene call hits the cache.

Skill A — Banana Pro Director. Owns character physical lock and wardrobe continuity. Emits a character sheet exactly once per actor; every later prompt prepends the sheet's lock_phrase.

# pip install anthropic
from anthropic import Anthropic
import json

client = Anthropic()  # picks up ANTHROPIC_API_KEY

DIRECTOR_SKILL = """You are the Banana Pro Director. For every character
introduced, emit exactly this JSON and nothing else:

{
  "character_id":  "kebab-case-slug",
  "physical": {
    "build": "...", "face_shape": "...", "skin": "...",
    "hair":  {"length": "...", "color": "...", "style": "..."},
    "eyes":  "...", "distinguishing": ["..."]
  },
  "wardrobe": [{
    "scene_tag": "...",
    "outfit":    "fabric-specific description",
    "fabric_notes": "thread count, weave, behaviour under motion",
    "accessories": ["..."]
  }],
  "lock_phrase": "60-word paragraph prepended to every still prompt"
}

Rules: once physical is set it never changes. Wardrobe entries may
be added but never retconned. The lock_phrase is the only string
that ever ships to the image model."""

def lock_character(brief: str) -> dict:
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=[{
            "type": "text",
            "text": DIRECTOR_SKILL,
            "cache_control": {"type": "ephemeral"},  # cache hit on call 2+
        }],
        messages=[{"role": "user", "content": brief}],
    )
    return json.loads(r.content[0].text)

Skill B — Cinema Worldbuilder. Converts shot descriptions into cinematography language. "Nice sunset shot" is rejected. Output uses real-world gear: lens, aperture, lighting setup, sensor, film stock, atmosphere.

WORLDBUILDER_SKILL = """You are the Cinema Worldbuilder. Convert a shot
description into a single dense prompt. You MUST specify:
  • lens     — focal length + character (35mm anamorphic, 85mm portrait)
  • aperture — depth-of-field intent (f/1.4 isolated, f/8 deep)
  • lighting — named technique (Rembrandt, split, butterfly, chiaroscuro)
  • sensor   — film or digital reference (Alexa LF, 16mm Kodak Vision3)
  • atmosphere — practical haze, anamorphic flare, lens breathing
  • grade    — film-stock reference (Kodak 2383, Fuji 3513)

Forbidden words: cinematic, beautiful, nice, stunning, epic.
Treat the diffusion model as a cinematographer who only speaks gear."""

def shot_prompt(shot: str, characters: list[dict]) -> str:
    locks = " ".join(c["lock_phrase"] for c in characters)
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[{"type": "text", "text": WORLDBUILDER_SKILL,
                 "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user",
                   "content": f"Characters: {locks}\nShot: {shot}"}],
    )
    return r.content[0].text

Why this matters

This is the move from prompt engineering to systems engineering. You stop hand-crafting paragraphs per shot and instead define rules once. Every later prompt is generated by a deterministic function of (character lock × shot description). Drift becomes a debugging problem, not an art problem.

STAGE 02

Scene-duration planning วางแผนเครดิตเป็นวินาที镜头时长规划

Claude credit budget

Seedance 2.0 charges per clip-second; Higsfield charges per credit. Asking Claude to emit a per-shot duration plan up front prevents the most common money sink: generating 10-second clips that get cut to 2 seconds in the NLE.

def budget(script: str) -> list[dict]:
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="""Given a script, return JSON list:
            [{"shot_id": "s01", "intent": "...", "seconds": float}]
        Sum of seconds must equal the script's target runtime.
        Round each shot up to the nearest 0.5s. No clip < 2s.""",
        messages=[{"role": "user", "content": script}],
    )
    plan = json.loads(r.content[0].text)
    print(f"Total seconds: {sum(s['seconds'] for s in plan)}")
    return plan

The total seconds × per-second engine rate is your hard budget. If it exceeds what you're willing to spend, the right answer is to cut shots, not shorten them — short clips lose more in pacing than long clips lose in cost.

STAGE 03

Static stills + vision feedback loop วงจรตรวจสอบภาพ静帧 + 视觉反馈回路

Nano Banana Pro Claude vision up to 5 characters

Nano Banana Pro (Gemini 3 Pro Image) is the asset-locking workhorse. Its "Consistency by Design" architecture accepts up to 14 reference images of up to 5 distinct people in a single call — the only consumer-tier image model that holds multi-character identity reliably. Generate the still, then feed it back to Claude with vision input. Claude's correction list rewrites the next prompt before any motion credit is spent.

from google import genai
from google.genai import types
import base64, pathlib

gem = genai.Client()

def render_still(prompt: str, refs: list[bytes], out: str):
    parts = [types.Part.from_bytes(data=r, mime_type="image/png") for r in refs]
    parts.append(prompt + " 16:9 cinematic")
    r = gem.models.generate_content(
        model="gemini-3-pro-image-preview",
        contents=parts,
        config=types.GenerateContentConfig(response_modalities=["IMAGE"]),
    )
    for p in r.candidates[0].content.parts:
        if p.inline_data:
            pathlib.Path(out).write_bytes(p.inline_data.data)

def verify(image_path: str, intent: str) -> list[str]:
    img = base64.standard_b64encode(pathlib.Path(image_path).read_bytes()).decode()
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": [
            {"type": "image", "source": {
                "type": "base64", "media_type": "image/png", "data": img}},
            {"type": "text", "text": f"""Director intent: {intent}

                Compare image to intent. Return JSON only:
                {{ "matches_intent": bool, "drift": ["..."], "fix_prompt": "..." }}
                Look for: wrong outfit, wrong hair, wrong number of subjects,
                wrong lens character, wrong lighting direction."""}
        ]}],
    )
    return json.loads(r.content[0].text)

# Loop until intent matches
for attempt in range(3):
    render_still(prompt, refs, "out/s01.png")
    check = verify("out/s01.png", intent)
    if check["matches_intent"]: break
    prompt = check["fix_prompt"]

Nano Banana Pro vs standard

Nano Banana (Gemini 2.5 Flash Image) is fast and casual — fine for the basic single-character pipeline. Nano Banana Pro (Gemini 3 Pro Image) does spatial reasoning before pixel generation. It actually renders legible long-form text, holds 5 identities, supports radical lighting changes without warping geometry, and obeys aspect-ratio retargeting (expanding background, locking subject coordinates). If any of those matter, pay the Pro tier.

STAGE 04

Motion synthesis — two engines, one router เครื่องยนต์สร้างการเคลื่อนไหว动作合成 — 双引擎路由

Seedance 2.0 Higsfield route by constraint

Different shots have different hardest constraints. The pipeline routes each one to whichever engine handles its constraint best, rather than picking a single engine and hoping.

Shot type	Hardest constraint	Route to
Synchronised choreography	Physics + multi-body coherence	Seedance 2.0
Singing close-up	Lip-sync to audio reference	Seedance 2.0
Camera move + VFX flares	Cinematic motion control	Higsfield
Hallucination patch	Localised inpainting in moving footage	Higsfield Canvas
Long take > 10s	Continuation without seam	Seedance "continue"

Seedance 2.0 ingests up to 9 images + 3 video clips + 3 audio references in one call. That audio input is the key — for a music video you pass the song stem and the model generates lip-sync directly from the waveform rather than from a separate sync pass.

import replicate

def seedance(still_path: str, motion_prompt: str, audio_path: str | None):
    inputs = {
        "image":    open(still_path, "rb"),
        "prompt":   motion_prompt,
        "duration": 5,
        "fps":      24,
    }
    if audio_path:                          # lip-sync from waveform
        inputs["audio"] = open(audio_path, "rb")
    return replicate.run("bytedance/seedance-2.0", input=inputs)

Higsfield is the cinematic-control layer that aggregates Sora 2, Veo 3, and its own proprietary models behind a single UI. The two features that justify reaching for it: Canvas (mask and regenerate a localised region across moving frames — the only way to fix a single-second hallucination without scrapping the whole clip) and Soul (binds video generation to the Nano Banana Pro reference sheet so faces don't drift). Higsfield is currently GUI-driven for most users; the programmatic path is to render the clean shots through Seedance and route only the broken ones to Higsfield Canvas manually.

Seedance endpoint quirk

If you call Replicate via raw HTTP, the Seedance 2.0 model-slug endpoint rejects a {"version": ...} field with HTTP 422. Use POST /v1/models/bytedance/seedance-2.0/predictions and drop the version field entirely. The Python SDK above handles this for you.

STAGE 05

Assembly — CapCut, or scripted ffmpeg ตัดต่อ剪辑合成

CapCut PC ffmpeg (scripted)

The source workflow uses CapCut PC's two-timeline feature for primary footage + B-roll layering, auto-captions, motion tracking, and "MagnatesMedia documentary"–style animated curve lines and 3D map overlays. That's a taste pass — opening CapCut and editing with the timeline is fine here.

If you want the assembly step to stay scripted (CI-friendly, re-runnable), ffmpeg does the structural cut and lets you keep CapCut for the final 30-second polish. The patterns the basic pipeline uses (concat demuxer + drawtext + amix) still apply, plus two additions for the multi-engine case:

# 1. Mixed-source xfade — different engines produce different colour spaces.
#    Force every clip through a normalisation filter BEFORE concat.
for i, clip in enumerate(clips):
    subprocess.run([
        "ffmpeg", "-y", "-i", clip,
        "-vf", "scale=3840:2160:flags=lanczos,format=yuv420p,setsar=1",
        "-c:v", "libx264", "-crf", "18", "-preset", "medium",
        f"out/norm_{i:02d}.mp4",
    ], check=True)

# 2. Lip-sync proof — extract audio and check Seedance's sync drift
#    against the source song. ffmpeg + a simple cross-correlation suffices.
subprocess.run([
    "ffmpeg", "-i", "out/norm_00.mp4",
    "-af", "silencedetect=noise=-30dB:d=0.3", "-f", "null", "-",
])

Three ffmpeg pitfalls — same as the basic page, repeated because they bite every project

(1) xfade renegotiates back to yuv444p — append ,format=yuv420p to the last filter and force -pix_fmt yuv420p on encode.

(2) drawtext can't parse Unicode escapes — write captions to a .txt file and use textfile=.

(3) For Thai glyphs use tahoma.ttf not tahomabd.ttf (Bold has no Thai code points).

STAGE 06

Upscale + master เพิ่มความคมและส่งออก升采样与成片

Aiarty VideoProc Real-ESRGAN (free)

Generative video frequently outputs sub-4K with mild compression artefacts. The source workflow uses Aiarty Video Enhancer for AI denoising and upscaling, then VideoProc Converter AI for H.265 encoding. Both are paid GUI tools.

The free programmatic equivalent: ffmpeg with Real-ESRGAN as a CUDA filter, or render the upscale through Replicate's nightmareai/real-esrgan model. Quality is comparable for the 1080p→4K case; the paid tools are faster, not better.

# Free path: Real-ESRGAN via Replicate, then H.265 via ffmpeg
out = replicate.run(
    "nightmareai/real-esrgan",
    input={"image": open("out/final_1080.mp4", "rb"), "scale": 2},
)
pathlib.Path("out/final_4k_raw.mp4").write_bytes(out.read())

subprocess.run([
    "ffmpeg", "-y", "-i", "out/final_4k_raw.mp4",
    "-c:v", "libx265", "-crf", "20", "-preset", "medium",
    "-tag:v", "hvc1",  # Safari/iOS playback
    "-c:a", "aac", "-b:a", "192k",
    "out/final_4k.mp4",
], check=True)

The aggregator landscape แพลตฟอร์มรวมเครื่องมือ聚合平台格局

If running six separate API keys is more friction than it's worth, the market has responded with workflow aggregators — single subscriptions that route between Gemini, Veo, Seedance, Kling, Sora, HeyGen, ElevenLabs, etc. behind one UI. You give up some control; you gain not juggling browser tabs.

Platform	What's bundled	Distinguisher
Vimerse Studio	Gemini, ElevenLabs, Flux, Imagen, Veo, Kling, Seedance	End-to-end script → voice → video
Tagshop AI	Nano Banana 2/Pro, Seedance, Sora 2, Kling 3, Wan 2.6, Hailuo	Highest-end model breadth
Vadoo AI	Veo, Kling, Hailuo, Sora, Runway	Built-in lip-sync studio + extension
Zeemo	Multi-model image + video	Coherent multi-shot export from one prompt
Baz V4	Veo 3.1, Seedance, HeyGen, Minimax TTS, Nano Banana	Chat-driven agentic timeline editor
Vizard AI	Seedance, Kling, multi-model	Auto-detect highlights, brand kits, B-roll

The strategic read: foundation models are commoditising into API endpoints. The competitive layer is moving up to orchestration. Pick an aggregator when you'd otherwise spend more time on glue than on the work; stay raw-API when the glue is the work.

Physical capture that grounds the synthetic อุปกรณ์จริงเสริมงาน AI实体采集补足 AI

Even an end-to-end synthetic music video benefits from a small amount of real capture. AI sound generators are still weak at sharp foley impacts; AI faces still composite better against real backplates than fully-generated ones.

Tascam DR-10L — pocket-size lav recorder. The source workflow used it for foley: real impact sounds for "what an alien sounds like when hit." Anything percussive or close-mic'd that AI struggles to invent.
Panasonic Lumix GH5 + 12–35mm — 10-bit 4K micro-four-thirds. Cheap, clean, and the dynamic range survives composite alongside Seedance output without aesthetic clash.

The pattern: the AI generates the impossible (synchronised dance, multiple identities, custom worlds); physical hardware grounds the parts where reality is cheaper than synthesis.

What it actually costs ค่าใช้จ่ายต่อมิวสิควิดีโอ实际成本

Claude tokens

$1–3

Director + Worldbuilder + vision

Nano Banana Pro

$5–10

~40–80 stills @ ~$0.12 each

Seedance / Higsfield

$8–18

~30–60 clip-seconds

Voice + audio

$1–3

ElevenLabs or Suno

Upscale + master

$0–3

Real-ESRGAN free; Aiarty paid

Total

~$15–35

per ~3-minute music video

The headline 1/500th-cost figure from Genvid's The Seeker only holds against million-dollar animated baselines. For a creator coming from the basic pipeline, the ratio is more like 10×: ~$25 for a 3-minute multi-character build vs. ~$2.50 for a 2-minute single-character build. The extra spend buys character continuity across shots and lip-sync that survives close-ups.

The structural shift การเปลี่ยนแปลงเชิงโครงสร้าง结构性变革

From prompt engineer to systems architect

The human role in this pipeline isn't "writer of clever prompts." It's "designer of the system that writes prompts." You write two long Skills once, define the routing rules between engines once, and the day-to-day work becomes specifying intent and reviewing output — not crafting paragraphs by hand.

Mid-tier media restructures, prestige holds

Hollywood-scale productions keep their physical crews and bespoke VFX. Everything below — YouTube documentaries, corporate training, ads, indie film, music videos — drops to a few API subscriptions and a competent operator. The Genvid 1/500th number is the asymptote, not the average; the practical ratio for mid-tier work is closer to 10–50× cheaper than the previous baseline.

Where value migrates

Foundation models commoditise into APIs. Aggregators commoditise into UIs. The durable value sits in two places: (1) the taste required to direct the system, and (2) the engineering required to integrate it into something repeatable. Neither is a model output.