V2 Complete · Developer Tool · Python CLI

Talking-Head Footage
→ Viral 9:16
Automatically.

A Python CLI that converts raw video into optimized vertical clips — face tracking, punch zoom, word-level captions, and loudnorm audio, all from a single command.

See How It Works View Features

$ clip-trimmer input.mp4 --output clip.mp4 --verbose

✓ Audio extracted — 00:04:12

✓ Speech segments detected — 94% speech ratio

✓ Whisper transcript — 847 words, word-level timestamps

✓ Face detected — tracking 1 subject

✓ Crop plan — 9:16 face-centred, 3 punch zones

✓ Captions built — 214 groups, active-word highlight

✓ Segments rendered — 14 segments, loudnorm pass

✓ Done — clip.mp4 (00:03:58) in 47.2s

TDD

Test-driven dev

Current version

9:16

Output format

1 cmd

To run

The Pipeline

One Command.
Six Stages.

Every step is deterministic, testable in isolation, and configurable via CLI flags.

01 / AUDIO

Extract & Detect Speech

WebRTC VAD strips silence. Only real speech regions go forward.

02 / TRANSCRIPT

Word-Level Timestamps

OpenAI Whisper produces per-word timestamps with model caching.

03 / FACE

Face Tracking

Detects and tracks the main subject for a face-centred 9:16 crop.

04 / EDIT

Crop & Punch Zoom Plan

Per-segment scale→crop pipeline with slow-zoom for engagement.

05 / CAPTIONS

ASS Caption Burn

3–5 word groups with active-word highlight — burned into the output.

06 / RENDER

FFmpeg Render & Loudnorm

Segments rendered and concatenated. Audio normalised to broadcast standard.

What's In V2

Built for Reliability.

Face-Centred Crop

Automatically centres the 9:16 crop on the detected face. Falls back to centre crop when no face is present.

Punch Zoom

Adds slow-zoom punch-in effects at strategic moments to increase retention without looking artificial.

ASS Captions with Highlight

Word-level timestamps from Whisper drive 3–5 word caption groups with an active-word highlight that tracks speech.

Silence Removal

WebRTC VAD detects and cuts silent sections. --silence-gap flag controls the minimum gap to cut.

Loudnorm Audio

EBU R128 loudness normalisation applied per-segment and on the final output. Consistent audio across all clips.

Test-driven, CI/CD

Unit tests cover every module. Automated CI runs on every push. Zero suppressed warnings.

Tech Stack

Built on Proven Tools.

Python

CLI · core pipeline

Whisper

Word-level transcription

WebRTC VAD

Speech detection

FFmpeg

Video render + loudnorm

OpenCV

Face detection + tracking

Automated testing

Tests · linting · CI

Build Status

V2 Complete.

All V2 modules are shipped. V3 (smart tracking, AI-guided cuts) is on the roadmap.

Core pipeline (audio → render)✓ V2 Complete

Whisper word-level timestamps✓ V2 Complete

Face tracking crop✓ V2 Complete

ASS captions + active word highlight✓ V2 Complete

55 unit tests + CI/CD✓ V2 Complete

E2E test on real clips→ In Progress

V3 — Smart tracking, AI-guided cutsRoadmap

Get in Touch

Contact

Questions, feedback, or integration requests — reach out directly.

contact@lpagesapplabs.com