From Script to Edited Video Overnight: The 4-Tool Stack for Solo Content Production

Portrait of Viktoriia Didur, founder of ViMaxus agency.

01.06.2026 10 min read

Four tools, working together, replace what used to take a video team. HeyGen Avatar 5 generates a photorealistic talking head from 15 seconds of webcam footage. ElevenLabs clones your voice from 30 minutes of audio. Remotion renders motion graphics in code. Claude Code orchestrates all three from a single command. A five-hour pipeline becomes an overnight job. Total stack cost: about 72 USD per month plus per-minute API fees.

The three shifts that made this possible

Video content used to need a team. A person to read the script. A camera operator. An audio engineer. A video editor. As of 2026, the same output comes from a four-tool stack that one person operates from a keyboard.

Three shifts converged to make this work, and all three matter together. None of them alone is enough.

Avatar models crossed the uncanny valley. HeyGen Avatar 5 is trained on more than 10 million facial expression data points and creates a digital twin from 15 seconds of webcam footage.
AI can orchestrate the full pipeline. Claude Code connects HeyGen, ElevenLabs and Remotion as a single multi-step workflow, with the human only writing the script.
The bottleneck moved. Production and post-production used to be the slow part. Now the slow part is what should be said. The human stays in the loop where it matters most: the ideas.

Layer 1: HeyGen Avatar 5. The talking head

HeyGen Avatar 5 is the visual layer. The model captures how a specific person gestures, blinks, swallows, glances around. The result is a clip that looks recorded, not animated.

HeyGen pricing (verified 2026-05-31)

Free: 3 videos per month, up to 1 minute each. Creator: 25 EUR per month, 600 credits, videos up to 30 minutes. Pro: 42 EUR per month, 1,000 credits, 4K export. Business: 128 EUR per month, 1,500 credits, videos up to 60 minutes. Avatar III costs 3 credits per minute. Avatar IV and V cost 20 credits per minute.

The catch: Avatar 5 is capped at 3 minutes per generation. Scripts longer than that have to be chunked. And as of mid-2026, Avatar 5 is not yet exposed via the HeyGen API. Only Avatar 3 and 4 are. Production pipelines work around this with a Playwright step that opens the HeyGen dashboard and re-renders each clip with Avatar 5 selected.

Build the avatar once. Either record a 15-second clip (fast, good enough for most uses) or upload up to 10 GB of footage to train a deeper model (slower, near-indistinguishable from the real person).

Layer 2: ElevenLabs. The voice clone

HeyGen ships with auto-cloned voice. It is not good enough. The lip sync is fine, the timbre is not. Replace it immediately with an ElevenLabs Professional Voice Clone imported into HeyGen as a third-party voice.

ElevenLabs pricing (verified 2026-05-31)

Free: 10,000 credits per month (about 10 minutes). Starter: instant voice cloning. Creator: 11 USD per month (22 USD first month), 121,000 credits (about 121 minutes), Professional Voice Cloning included. Pro: 600,000 credits per month. Scale, Business: enterprise tiers.

Practice rule: feed the Professional Voice Clone at least 30 minutes of clean audio. Two hours is better. Quality compounds with sample size. After cloning, you can tune speed, stability, similarity and style exaggeration per generation.

ElevenLabs degrades on long generations. Past about one minute of continuous audio, the voice drifts. The sweet spot is 45 to 60 seconds per chunk. Cut at sentence boundaries, never mid-sentence. The stitched final video will give itself away otherwise.

Layer 3: Remotion. Motion graphics in code

Remotion is a React framework for video. You write components in TypeScript, you get a rendered MP4. For an AI orchestration pipeline, that is exactly the right shape: the AI writes the component, the renderer produces the file.

What Remotion adds to the stack is the motion graphics layer that distinguishes finished content from a person talking to a webcam. Lower-thirds, animated transitions, key-point callouts, brand overlays. All of it timed to the transcript, all of it driven by data.

Standard pattern: transcribe the chunked avatar clips. Pass timestamps and transcript text into a Remotion composition. The composition triggers animations at the right second. The result is a single MP4 with synced motion graphics, ready to publish.

Layer 4: Claude Code. The orchestrator

The three tools above only matter if they work as a single chain. Claude Code is the layer that holds that chain together. It reads the script. It chunks it at sentence boundaries to keep each chunk under one minute. It sends each chunk to ElevenLabs for audio. It feeds the audio to HeyGen for the avatar video. It downloads each clip, stitches with FFmpeg, hands off to Remotion for motion graphics. It then writes the final file to disk.

What this looks like in practice

Drop a 10-minute script into a folder. Tell Claude Code: process lessons 5.0 through 5.4. Go to sleep. Wake up to five finished videos. Total human time during the run: zero.

The Playwright workaround for Avatar 5 API limitation

One sharp edge in this stack is worth naming. HeyGen API supports Avatar 3 and Avatar 4. Avatar 5 is dashboard-only as of mid-2026. The fix in production pipelines is a Playwright headless browser step that opens the HeyGen dashboard, finds each clip generated via API in Avatar 4, clicks New Revision in AI Studio, switches the Avatar to 5, and re-generates.

It is a hack. It will become unnecessary the moment HeyGen exposes Avatar 5 via API. Until then, treat it as a temporary bridge and budget the extra wall-clock time per video for the Playwright leg of the pipeline.

What the full stack costs

Three subscription line items and one variable API line item:

HeyGen Creator: 25 EUR per month (subscription, not enough for API at scale)
ElevenLabs Creator: 11 USD per month (regular price; 22 USD first month)
Claude Code: 20 to 200 USD per month depending on tier
HeyGen API: about 4 USD per 1-minute clip at Avatar 4 (so a 10-minute video runs roughly 40 USD in API fees)

Compare to the alternative. A freelance video editor costs 35 to 75 USD per hour. A 10-minute YouTube video might take 4 to 6 hours of editing, so 140 to 450 USD per video. Add a voiceover artist or a recording studio and the per-video cost climbs past 500 USD. Even at 50 USD of API fees per video, the AI stack runs 5 to 10 times cheaper, and it gives you the hours back.

Three objections and the honest answers

Is this fake or inauthentic?

Partly fair. The script is yours, the voice is yours, the face is yours. Only the chair-time is missing. The right test is whether the content is your real thinking. If you wrote the script, the video is yours. If you fed a generic prompt to ChatGPT and ran the output through an avatar, the audience will know within two clips.

Will this flood the internet with AI slop?

AI writing tools already exist. The flood is happening regardless of what video stack people use. What changes is that quality is now the actual filter. Bad content with a polished avatar is still bad content. The bottleneck moved from production to ideas, which is the bottleneck that was always supposed to be there.

Does this kill video editor jobs?

It changes the job. Some specific tasks (cut, trim, sync) get automated. The editors who survive are the ones who apply subject-matter expertise on top: pacing, narrative structure, the editorial judgment about what to cut. Generic editing-as-a-service compresses. Editorial direction expands.

Where to start

Record a 15-second webcam clip and create your HeyGen Avatar 5 clone. Free tier is enough to test.
Upload 30 minutes of your voice into ElevenLabs and create a Professional Voice Clone. Import it into HeyGen as a third-party voice.
Generate one 45-second test clip from a real script you would otherwise read aloud yourself. Compare to the auto-cloned voice. The difference is the value of layer 2.
Only after the first end-to-end test, start building the Claude Code orchestration. Premature orchestration wastes weeks. First prove the unit economics work for your specific voice and use case.

The pipeline gives back roughly 10 hours per week once dialed in. At a stack cost of about 250 USD per month, that is 6 USD per hour to buy your own time back. If your hour is worth more than 6 USD, the math is uncomfortable for not doing this.

Questions People Ask About This

"HeyGen Avatar 5 vs Avatar 4 quality"

"how to clone voice with ElevenLabs Professional"

"Claude Code video production automation"

"Remotion vs traditional video editor"

"is HeyGen Avatar 5 available via API"

"AI avatar pricing 2026 monthly"

Frequently Asked Questions

Can I generate Avatar 5 videos via the HeyGen API?

Not as of mid-2026. The HeyGen API supports Avatar 3 and Avatar 4. Avatar 5 is dashboard-only. Production pipelines either accept Avatar 4 output or add a Playwright step that re-renders each clip in the dashboard. HeyGen has signaled Avatar 5 API support is on the roadmap, but no public date as of the writing of this article.

Why does ElevenLabs sound worse inside HeyGen than in ElevenLabs directly?

HeyGen re-encodes audio when it ingests a third-party voice clip. The compression introduces artifacts that flatten the timbre. The workaround is to generate the audio in ElevenLabs as cleanly as possible (44.1 kHz, mono, no extra processing), then upload directly to HeyGen as a separate audio file under AI Studio, paired with the avatar. This avoids the in-platform voice mismatch.

What is the right length for an ElevenLabs voice generation?

Between 45 and 60 seconds. The model degrades past one minute of continuous audio: the voice starts to sound less like the cloned speaker. Chunk every script at sentence boundaries to stay in this window, then stitch in post.

Is 30 minutes of voice data enough for a Professional Voice Clone?

Enough for usable. Not enough for indistinguishable. ElevenLabs documentation recommends 3 hours of clean audio for the best result. The model improves materially up to about 2 hours and then plateaus. Below 30 minutes the clone is recognisable as you but does not feel quite right.

How much does the full stack actually cost per finished video?

For a typical 10-minute video: HeyGen Creator subscription covers a few minutes at Avatar 5 dashboard rate. Beyond that, HeyGen API at Avatar 4 costs about 4 USD per minute, so 40 USD in API fees. ElevenLabs cost is negligible inside the Creator plan for normal volumes. Claude Code costs depend on tier. All in, expect 50 to 80 USD per finished 10-minute video, before any developer setup time.

What is the legal status of using an AI clone of your own face and voice?

For your own likeness, it is yours to use. The grey area is consent: never clone someone else’s face or voice without explicit written consent. HeyGen requires identity verification before training a clone. ElevenLabs requires a consent recording. Both platforms publish terms of service that prohibit cloning third parties without permission.

Does this replace YouTube video production entirely?

It can. Most creators choose not to. The pattern that has emerged is to keep flagship long-form content human-recorded for authenticity, and use the AI stack for short-form, course material, advertisements, and any high-volume content where production cost was the limiting factor. The audience cares about authentic thinking; they care less about whether you sat in front of a camera to deliver it.

The bottleneck moved from production to ideas.

The four-tool stack handles production. The strategy and the script stay yours. That is the right split.

Vimaxus

We help solo founders and small teams design AI content pipelines that survive past the first novelty video. From HeyGen avatar setup to Remotion templates to Claude Code orchestration, we ship pipelines that produce real output every week.

Talk to Vimaxus about your video pipeline

Written by Viktoriia Didur and Elis