π What Composite Captions actually does
Composite Captions takes your uploaded vertical video and renders Sleepy Motion's full motion graphics treatment , animated text screens, kinetic typography, transitions , behind the subject of the video rather than on top of them.
The subject appears to stand in front of the animated graphics, creating genuine cinematic depth. The kind of shot that looks like a professional spent hours in After Effects rotoscoping the person frame by frame, then carefully animated text behind them.
Sleepy Motion does the entire thing automatically. Upload the video. Pick the spoken language. Hit Generate.
π Why nobody else offers this
Every other captions and subtitle tool on the market does one thing: put text on top of your video. Flat overlay. Text over face. The subject is buried under the words.
Getting text to appear behind a subject requires two things that no automated tool has combined until now:
- Accurate subject segmentation , isolating the person (or people) from the background on every frame of the video.
- A full motion graphics engine , something capable of producing animated, designed text screens worth putting behind someone in the first place.
Caption tools have neither. Motion graphics tools have the second but require manual masking. After Effects has both but demands hours of skilled manual work per video.
Sleepy Motion combines subject masking with its full motion graphics pipeline into a single automated process. This is genuinely new. We searched and found no other tool that does this.
The gap this fills
Creators and marketers have been manually doing this in After Effects and Premiere for years , it's one of the most popular βpro lookβ techniques in social video. It just takes hours. Composite Captions delivers the same result in minutes, with no software, no skills, and no timeline.
π― How the subject masking works
When you submit a Composite Captions job, the engine runs subject segmentation across your video , detecting and isolating the person or people in each frame. This produces a per-frame mask that defines exactly where the subject is at every moment.
The motion graphics layer is then rendered into the video using this mask as a depth separator: graphics sit behind the mask layer, subject sits in front. The result is a clean, natural-looking composite where the person physically appears to be standing in front of the animated text.
The masking algorithm handles:
- Single subjects and multiple people in frame
- Movement , the mask follows the subject across the frame
- Partial occlusion , arms, hair, and edges are tracked accurately
- Fast cuts and scene changes
π§ Dynamic placement , how the engine decides where text goes
This is what separates Composite Captions from anything else on the market. The engine doesn't just stick text in one fixed spot on every screen.
For each text screen, Sleepy Motion's placement algorithm analyzes the current frame , subject position, how much of the frame they occupy, where they are moving , and decides the optimal position for that specific moment. Text moves, breathes, and repositions throughout the video in a way that feels intentional and designed, not templated.
What dynamic placement does
- β Finds open space in the frame per screen
- β Avoids covering the subject's face
- β Adapts to movement and repositioning
- β Creates natural visual rhythm across the video
The result
- β Every screen feels intentionally composed
- β Text never fights the subject for attention
- β Looks like a director made layout decisions
- β No two screens look identical
If you prefer full manual control over placement, switch the Position setting from Dynamic to Fixed , you can then use the transform positioner to set an exact location that applies consistently across all screens.
π How to use it: step by step
- 1
Switch to Captions mode
At the top of the form, click the Captions tab. The sub-switcher defaults to Composite , leave it there.
- 2
Upload your vertical video
MP4 only, vertical format (9:16), up to 150MB, up to 30 seconds. The upload area shows a thumbnail preview once the file is loaded. Hit the X on the thumbnail to swap it out.
- 3
Set the Caption language
Pick the language spoken in the video. This is not a translation , it tells the transcription engine what language to expect, which dramatically improves accuracy. If you leave it on Auto, the engine will detect it, but setting it manually is always more reliable.
- 4
Adjust Effects and Blend if needed
Leave Effects on None and Blend on Normal for your first generation. Once you see the result you can experiment , Fluid adds a deformer effect to the graphics, and Blend modes change how the graphics layer interacts with the video.
- 5
Hit Generate
The engine transcribes your audio, segments the subject, places and renders the motion graphics behind them, and delivers the finished video. Takes a few minutes.
π Position: Dynamic vs Fixed
The Position setting is only available in Composite mode and controls how the engine places each text screen.
Dynamic (default)
The engine analyzes each frame and chooses the best position for that specific moment. Text placement varies throughout the video , it follows the composition, avoids the subject's face, and creates a natural, editorial rhythm.
Best for: most videos
Fixed
Every text screen appears at a consistent position you define using the transform positioner. Predictable, stable, good for videos where you know exactly where you want text to live.
Best for: controlled compositions
β¨ Effects and Blend modes
These two controls let you push the visual treatment further once you have a solid base result.
Effects
- None , standard Sleepy Motion motion graphics, clean and sharp.
- Fluid , adds a fluid deformer to the graphics layer, creating organic, liquid-like distortion in the text and shapes. High energy, very distinctive look.
Blend modes
Blend modes control how the motion graphics layer interacts with the video frames behind it. They behave identically to blend modes in After Effects or Photoshop.
- Normal , graphics render as solid elements, no interaction with the video.
- Screen , dark areas of the graphics become transparent, light areas glow through. Good for light, airy treatments.
- Overlay , increases contrast and saturation where the graphics meet the video. Punchy, high-energy feel.
- Soft Light , a gentler version of Overlay. Adds depth without harshness.
- Difference , inverts colors where layers overlap. Surreal, experimental. Use deliberately.
Recommended starting point
Start with Normal and generate once. Then try Screen if your video has a lot of dark areas, or Overlay if you want maximum visual punch. Blend modes change the feel dramatically , always compare against your Normal baseline before committing.
β Best practices for great results
- π₯
Use clean, well-lit footage
Good lighting with clear contrast between subject and background gives the masking algorithm the most to work with. Low-light or heavily compressed video will produce softer edges.
- π£οΈ
Set the language manually
Auto-detection works, but specifying the spoken language always produces more accurate transcription. A wrong word on a caption screen undermines an otherwise perfect result.
- β±οΈ
Keep videos under 30 seconds for now
The current limit is 30 seconds. Longer-form support is coming. For content that runs long, trim to the most impactful 20β30 seconds before uploading.
- π
Start with Dynamic placement
The algorithm is good at finding open space in the frame. Let it run first , you will almost always prefer the dynamic result over a fixed position. Switch to Fixed only if you have a specific reason.
- π
Retry to refine
Like all Sleepy Motion modes, the Edit & Retry workflow applies. If the first generation is 80% there, use the transform positioner and language setting to dial in the remaining 20% without starting over.
The honest bar
Composite Captions produces results that match what a skilled After Effects editor would deliver in hours , automatically, in minutes. The masking is not pixel-perfect on every frame of every video. It is production-quality for the vast majority of well-shot vertical content. Treat the first generation as a strong draft, use Retry to polish, and you will consistently get results worth publishing.