The Deterministic AI Video Pipeline
ขั้นตอนผลิตวิดีโอแบบมืออาชีพ — หลายตัวละคร หลายเครื่องมือ
专业级 AI 视频生产流程 — 多角色 多引擎
The next tier up ขั้นถัดไป进阶版本
English
The basic pipeline covers one character, single-engine, ~$2.50 per video. This page documents the architecture you reach for when that breaks down: multiple characters dancing in unison, exact lip-sync, multi-shot continuity, broadcast-grade post.
If a pipeline survives synthetic K-pop, it survives anything.
ไทย
Pipeline พื้นฐาน รองรับตัวละครเดียว เครื่องยนต์เดียว ~$2.50 ต่อวิดีโอ หน้านี้บันทึกสถาปัตยกรรมที่คุณต้องใช้เมื่อแบบนั้นไม่พอ: หลายตัวละครเต้นพร้อมกัน lip-sync เป๊ะ ความต่อเนื่องข้ามช็อต และ post ระดับออกอากาศ
ถ้า pipeline ผ่านมิวสิควิดีโอ K-pop สังเคราะห์ได้ ก็ผ่านทุกอย่างได้
中文
基础流程支持单角色、单引擎,每个视频约 $2.50。本页记录的是当基础流程不够用时的进阶架构:多角色同步舞蹈、精准对口型、跨镜头连贯性、广播级后期。
如果一个流程能搞定合成 K-pop,就能搞定任何项目。
The reference workflow comes from creator "JOEY" (Joey Edits, ~225k subs), who built a synthetic K-pop music video — chosen specifically because K-pop is the hardest case for generative AI: synchronised choreography, uniform aesthetic, exact lip-sync across multiple band members. If a pipeline survives K-pop it survives anything.
The shape is the same six-stage architecture as the basic build, with three architectural upgrades layered on top:
- Claude "Skills" replace ad-hoc prompts — two specialised system prompts ("Banana Pro Director" + "Cinema Worldbuilder") that emit locked character sheets and cinematography language deterministically.
- Vision feedback loop — every generated still is fed back to Claude with vision input, which flags drift and rewrites the next prompt before any motion credit is spent.
- Dual motion engines — Seedance 2.0 for physics + lip-sync, Higsfield for cinematic VFX and localised inpainting. Each clip routes to whichever engine handles its hardest constraint.
Genvid's sci-fi production The Seeker used the Higsfield-based pipeline at roughly 1/500th the cost of their previous animated work (Silent Hill scale). That number is the macro story — the page below is the mechanics that get you there.
Pipeline at a glance ภาพรวม流程一览
| Stage | Output | Primary tool | Programmatic path |
|---|---|---|---|
| 1. Orchestration | Locked character sheets + cinematography prompts | Claude Skills | Anthropic API + cached system prompts |
| 2. Scene planning | Per-shot duration budget | Claude | JSON shot list, sum to clip credits |
| 3. Static stills | 4K reference frames, up to 5 characters | Nano Banana Pro | gemini-3-pro-image-preview |
| 3b. Vision check | Drift report → prompt rewrite | Claude Sonnet 4.6 | image input → JSON corrections |
| 4. Motion | 5–10s clips, physics + lip-sync | Seedance 2.0 / Higsfield | Replicate · Higsfield API |
| 5. Assembly | Multi-timeline cut + motion graphics | CapCut PC | ffmpeg + drawtext (alt) |
| 6. Master | 4K H.265 delivery file | Aiarty / VideoProc | ffmpeg + Real-ESRGAN (alt) |
Claude "Skills" — Director & Worldbuilder ระบบ AI ผู้กำกับClaude 技能 — 导演与世界构建器
Claude Sonnet 4.6 prompt caching JOEY's "skills"The creator describes building two custom Claude "Skills" over two weeks. In API terms a Skill is a long system prompt that pins behaviour and output schema. Cache it once and every subsequent character/scene call hits the cache.
Skill A — Banana Pro Director. Owns character physical lock and wardrobe continuity. Emits a character sheet exactly once per actor; every later prompt prepends the sheet's lock_phrase.
# pip install anthropic from anthropic import Anthropic import json client = Anthropic() # picks up ANTHROPIC_API_KEY DIRECTOR_SKILL = """You are the Banana Pro Director. For every character introduced, emit exactly this JSON and nothing else: { "character_id": "kebab-case-slug", "physical": { "build": "...", "face_shape": "...", "skin": "...", "hair": {"length": "...", "color": "...", "style": "..."}, "eyes": "...", "distinguishing": ["..."] }, "wardrobe": [{ "scene_tag": "...", "outfit": "fabric-specific description", "fabric_notes": "thread count, weave, behaviour under motion", "accessories": ["..."] }], "lock_phrase": "60-word paragraph prepended to every still prompt" } Rules: once physical is set it never changes. Wardrobe entries may be added but never retconned. The lock_phrase is the only string that ever ships to the image model.""" def lock_character(brief: str) -> dict: r = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, system=[{ "type": "text", "text": DIRECTOR_SKILL, "cache_control": {"type": "ephemeral"}, # cache hit on call 2+ }], messages=[{"role": "user", "content": brief}], ) return json.loads(r.content[0].text)
Skill B — Cinema Worldbuilder. Converts shot descriptions into cinematography language. "Nice sunset shot" is rejected. Output uses real-world gear: lens, aperture, lighting setup, sensor, film stock, atmosphere.
WORLDBUILDER_SKILL = """You are the Cinema Worldbuilder. Convert a shot description into a single dense prompt. You MUST specify: • lens — focal length + character (35mm anamorphic, 85mm portrait) • aperture — depth-of-field intent (f/1.4 isolated, f/8 deep) • lighting — named technique (Rembrandt, split, butterfly, chiaroscuro) • sensor — film or digital reference (Alexa LF, 16mm Kodak Vision3) • atmosphere — practical haze, anamorphic flare, lens breathing • grade — film-stock reference (Kodak 2383, Fuji 3513) Forbidden words: cinematic, beautiful, nice, stunning, epic. Treat the diffusion model as a cinematographer who only speaks gear.""" def shot_prompt(shot: str, characters: list[dict]) -> str: locks = " ".join(c["lock_phrase"] for c in characters) r = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=[{"type": "text", "text": WORLDBUILDER_SKILL, "cache_control": {"type": "ephemeral"}}], messages=[{"role": "user", "content": f"Characters: {locks}\nShot: {shot}"}], ) return r.content[0].text
This is the move from prompt engineering to systems engineering. You stop hand-crafting paragraphs per shot and instead define rules once. Every later prompt is generated by a deterministic function of (character lock × shot description). Drift becomes a debugging problem, not an art problem.
Scene-duration planning วางแผนเครดิตเป็นวินาที镜头时长规划
Claude credit budgetSeedance 2.0 charges per clip-second; Higsfield charges per credit. Asking Claude to emit a per-shot duration plan up front prevents the most common money sink: generating 10-second clips that get cut to 2 seconds in the NLE.
def budget(script: str) -> list[dict]: r = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, system="""Given a script, return JSON list: [{"shot_id": "s01", "intent": "...", "seconds": float}] Sum of seconds must equal the script's target runtime. Round each shot up to the nearest 0.5s. No clip < 2s.""", messages=[{"role": "user", "content": script}], ) plan = json.loads(r.content[0].text) print(f"Total seconds: {sum(s['seconds'] for s in plan)}") return plan
The total seconds × per-second engine rate is your hard budget. If it exceeds what you're willing to spend, the right answer is to cut shots, not shorten them — short clips lose more in pacing than long clips lose in cost.
Static stills + vision feedback loop วงจรตรวจสอบภาพ静帧 + 视觉反馈回路
Nano Banana Pro Claude vision up to 5 charactersNano Banana Pro (Gemini 3 Pro Image) is the asset-locking workhorse. Its "Consistency by Design" architecture accepts up to 14 reference images of up to 5 distinct people in a single call — the only consumer-tier image model that holds multi-character identity reliably. Generate the still, then feed it back to Claude with vision input. Claude's correction list rewrites the next prompt before any motion credit is spent.
from google import genai from google.genai import types import base64, pathlib gem = genai.Client() def render_still(prompt: str, refs: list[bytes], out: str): parts = [types.Part.from_bytes(data=r, mime_type="image/png") for r in refs] parts.append(prompt + " 16:9 cinematic") r = gem.models.generate_content( model="gemini-3-pro-image-preview", contents=parts, config=types.GenerateContentConfig(response_modalities=["IMAGE"]), ) for p in r.candidates[0].content.parts: if p.inline_data: pathlib.Path(out).write_bytes(p.inline_data.data) def verify(image_path: str, intent: str) -> list[str]: img = base64.standard_b64encode(pathlib.Path(image_path).read_bytes()).decode() r = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": [ {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": img}}, {"type": "text", "text": f"""Director intent: {intent} Compare image to intent. Return JSON only: {{ "matches_intent": bool, "drift": ["..."], "fix_prompt": "..." }} Look for: wrong outfit, wrong hair, wrong number of subjects, wrong lens character, wrong lighting direction."""} ]}], ) return json.loads(r.content[0].text) # Loop until intent matches for attempt in range(3): render_still(prompt, refs, "out/s01.png") check = verify("out/s01.png", intent) if check["matches_intent"]: break prompt = check["fix_prompt"]
Nano Banana (Gemini 2.5 Flash Image) is fast and casual — fine for the basic single-character pipeline. Nano Banana Pro (Gemini 3 Pro Image) does spatial reasoning before pixel generation. It actually renders legible long-form text, holds 5 identities, supports radical lighting changes without warping geometry, and obeys aspect-ratio retargeting (expanding background, locking subject coordinates). If any of those matter, pay the Pro tier.
Motion synthesis — two engines, one router เครื่องยนต์สร้างการเคลื่อนไหว动作合成 — 双引擎路由
Seedance 2.0 Higsfield route by constraintDifferent shots have different hardest constraints. The pipeline routes each one to whichever engine handles its constraint best, rather than picking a single engine and hoping.
| Shot type | Hardest constraint | Route to |
|---|---|---|
| Synchronised choreography | Physics + multi-body coherence | Seedance 2.0 |
| Singing close-up | Lip-sync to audio reference | Seedance 2.0 |
| Camera move + VFX flares | Cinematic motion control | Higsfield |
| Hallucination patch | Localised inpainting in moving footage | Higsfield Canvas |
| Long take > 10s | Continuation without seam | Seedance "continue" |
Seedance 2.0 ingests up to 9 images + 3 video clips + 3 audio references in one call. That audio input is the key — for a music video you pass the song stem and the model generates lip-sync directly from the waveform rather than from a separate sync pass.
import replicate def seedance(still_path: str, motion_prompt: str, audio_path: str | None): inputs = { "image": open(still_path, "rb"), "prompt": motion_prompt, "duration": 5, "fps": 24, } if audio_path: # lip-sync from waveform inputs["audio"] = open(audio_path, "rb") return replicate.run("bytedance/seedance-2.0", input=inputs)
Higsfield is the cinematic-control layer that aggregates Sora 2, Veo 3, and its own proprietary models behind a single UI. The two features that justify reaching for it: Canvas (mask and regenerate a localised region across moving frames — the only way to fix a single-second hallucination without scrapping the whole clip) and Soul (binds video generation to the Nano Banana Pro reference sheet so faces don't drift). Higsfield is currently GUI-driven for most users; the programmatic path is to render the clean shots through Seedance and route only the broken ones to Higsfield Canvas manually.
If you call Replicate via raw HTTP, the Seedance 2.0 model-slug endpoint rejects a {"version": ...} field with HTTP 422. Use POST /v1/models/bytedance/seedance-2.0/predictions and drop the version field entirely. The Python SDK above handles this for you.
Assembly — CapCut, or scripted ffmpeg ตัดต่อ剪辑合成
CapCut PC ffmpeg (scripted)The source workflow uses CapCut PC's two-timeline feature for primary footage + B-roll layering, auto-captions, motion tracking, and "MagnatesMedia documentary"–style animated curve lines and 3D map overlays. That's a taste pass — opening CapCut and editing with the timeline is fine here.
If you want the assembly step to stay scripted (CI-friendly, re-runnable), ffmpeg does the structural cut and lets you keep CapCut for the final 30-second polish. The patterns the basic pipeline uses (concat demuxer + drawtext + amix) still apply, plus two additions for the multi-engine case:
# 1. Mixed-source xfade — different engines produce different colour spaces. # Force every clip through a normalisation filter BEFORE concat. for i, clip in enumerate(clips): subprocess.run([ "ffmpeg", "-y", "-i", clip, "-vf", "scale=3840:2160:flags=lanczos,format=yuv420p,setsar=1", "-c:v", "libx264", "-crf", "18", "-preset", "medium", f"out/norm_{i:02d}.mp4", ], check=True) # 2. Lip-sync proof — extract audio and check Seedance's sync drift # against the source song. ffmpeg + a simple cross-correlation suffices. subprocess.run([ "ffmpeg", "-i", "out/norm_00.mp4", "-af", "silencedetect=noise=-30dB:d=0.3", "-f", "null", "-", ])
(1) xfade renegotiates back to yuv444p — append ,format=yuv420p to the last filter and force -pix_fmt yuv420p on encode.
(2) drawtext can't parse Unicode escapes — write captions to a .txt file and use textfile=.
(3) For Thai glyphs use tahoma.ttf not tahomabd.ttf (Bold has no Thai code points).
Upscale + master เพิ่มความคมและส่งออก升采样与成片
Aiarty VideoProc Real-ESRGAN (free)Generative video frequently outputs sub-4K with mild compression artefacts. The source workflow uses Aiarty Video Enhancer for AI denoising and upscaling, then VideoProc Converter AI for H.265 encoding. Both are paid GUI tools.
The free programmatic equivalent: ffmpeg with Real-ESRGAN as a CUDA filter, or render the upscale through Replicate's nightmareai/real-esrgan model. Quality is comparable for the 1080p→4K case; the paid tools are faster, not better.
# Free path: Real-ESRGAN via Replicate, then H.265 via ffmpeg out = replicate.run( "nightmareai/real-esrgan", input={"image": open("out/final_1080.mp4", "rb"), "scale": 2}, ) pathlib.Path("out/final_4k_raw.mp4").write_bytes(out.read()) subprocess.run([ "ffmpeg", "-y", "-i", "out/final_4k_raw.mp4", "-c:v", "libx265", "-crf", "20", "-preset", "medium", "-tag:v", "hvc1", # Safari/iOS playback "-c:a", "aac", "-b:a", "192k", "out/final_4k.mp4", ], check=True)
The aggregator landscape แพลตฟอร์มรวมเครื่องมือ聚合平台格局
If running six separate API keys is more friction than it's worth, the market has responded with workflow aggregators — single subscriptions that route between Gemini, Veo, Seedance, Kling, Sora, HeyGen, ElevenLabs, etc. behind one UI. You give up some control; you gain not juggling browser tabs.
| Platform | What's bundled | Distinguisher |
|---|---|---|
| Vimerse Studio | Gemini, ElevenLabs, Flux, Imagen, Veo, Kling, Seedance | End-to-end script → voice → video |
| Tagshop AI | Nano Banana 2/Pro, Seedance, Sora 2, Kling 3, Wan 2.6, Hailuo | Highest-end model breadth |
| Vadoo AI | Veo, Kling, Hailuo, Sora, Runway | Built-in lip-sync studio + extension |
| Zeemo | Multi-model image + video | Coherent multi-shot export from one prompt |
| Baz V4 | Veo 3.1, Seedance, HeyGen, Minimax TTS, Nano Banana | Chat-driven agentic timeline editor |
| Vizard AI | Seedance, Kling, multi-model | Auto-detect highlights, brand kits, B-roll |
The strategic read: foundation models are commoditising into API endpoints. The competitive layer is moving up to orchestration. Pick an aggregator when you'd otherwise spend more time on glue than on the work; stay raw-API when the glue is the work.
Physical capture that grounds the synthetic อุปกรณ์จริงเสริมงาน AI实体采集补足 AI
Even an end-to-end synthetic music video benefits from a small amount of real capture. AI sound generators are still weak at sharp foley impacts; AI faces still composite better against real backplates than fully-generated ones.
- Tascam DR-10L — pocket-size lav recorder. The source workflow used it for foley: real impact sounds for "what an alien sounds like when hit." Anything percussive or close-mic'd that AI struggles to invent.
- Panasonic Lumix GH5 + 12–35mm — 10-bit 4K micro-four-thirds. Cheap, clean, and the dynamic range survives composite alongside Seedance output without aesthetic clash.
The pattern: the AI generates the impossible (synchronised dance, multiple identities, custom worlds); physical hardware grounds the parts where reality is cheaper than synthesis.
What it actually costs ค่าใช้จ่ายต่อมิวสิควิดีโอ实际成本
The headline 1/500th-cost figure from Genvid's The Seeker only holds against million-dollar animated baselines. For a creator coming from the basic pipeline, the ratio is more like 10×: ~$25 for a 3-minute multi-character build vs. ~$2.50 for a 2-minute single-character build. The extra spend buys character continuity across shots and lip-sync that survives close-ups.
The structural shift การเปลี่ยนแปลงเชิงโครงสร้าง结构性变革
The human role in this pipeline isn't "writer of clever prompts." It's "designer of the system that writes prompts." You write two long Skills once, define the routing rules between engines once, and the day-to-day work becomes specifying intent and reviewing output — not crafting paragraphs by hand.
Hollywood-scale productions keep their physical crews and bespoke VFX. Everything below — YouTube documentaries, corporate training, ads, indie film, music videos — drops to a few API subscriptions and a competent operator. The Genvid 1/500th number is the asymptote, not the average; the practical ratio for mid-tier work is closer to 10–50× cheaper than the previous baseline.
Foundation models commoditise into APIs. Aggregators commoditise into UIs. The durable value sits in two places: (1) the taste required to direct the system, and (2) the engineering required to integrate it into something repeatable. Neither is a model output.