TutorialsComposite Captions: Motion Graphics Behind Your Subject
Deep Dive8 min read

Composite Captions: Motion Graphics Behind Your Subject

Composite Captions is a first-of-its-kind pipeline that renders full Sleepy Motion motion graphics behind the subject of your video using automatic subject masking and AI-driven dynamic placement. No manual rotoscoping, no keyframes, no After Effects. Upload your video, pick your language, and let the engine do what used to take hours.

🎭 What Composite Captions actually does

Composite Captions takes your uploaded vertical video and renders Sleepy Motion's full motion graphics treatment , animated text screens, kinetic typography, transitions , behind the subject of the video rather than on top of them.

The subject appears to stand in front of the animated graphics, creating genuine cinematic depth. The kind of shot that looks like a professional spent hours in After Effects rotoscoping the person frame by frame, then carefully animated text behind them.

Sleepy Motion does the entire thing automatically. Upload the video. Pick the spoken language. Hit Generate.

What you get: your original video with full Sleepy Motion motion graphics depth-composited behind the subject , synchronized to the spoken words, dynamically placed per screen, production-ready in minutes.

🌍 Why nobody else offers this

Every other captions and subtitle tool on the market does one thing: put text on top of your video. Flat overlay. Text over face. The subject is buried under the words.

Getting text to appear behind a subject requires two things that no automated tool has combined until now:

  1. Accurate subject segmentation , isolating the person (or people) from the background on every frame of the video.
  2. A full motion graphics engine , something capable of producing animated, designed text screens worth putting behind someone in the first place.

Caption tools have neither. Motion graphics tools have the second but require manual masking. After Effects has both but demands hours of skilled manual work per video.

Sleepy Motion combines subject masking with its full motion graphics pipeline into a single automated process. This is genuinely new. We searched and found no other tool that does this.

The gap this fills

Creators and marketers have been manually doing this in After Effects and Premiere for years , it's one of the most popular β€œpro look” techniques in social video. It just takes hours. Composite Captions delivers the same result in minutes, with no software, no skills, and no timeline.

🎯 How the subject masking works

When you submit a Composite Captions job, the engine runs subject segmentation across your video , detecting and isolating the person or people in each frame. This produces a per-frame mask that defines exactly where the subject is at every moment.

The motion graphics layer is then rendered into the video using this mask as a depth separator: graphics sit behind the mask layer, subject sits in front. The result is a clean, natural-looking composite where the person physically appears to be standing in front of the animated text.

The masking algorithm handles:

  • Single subjects and multiple people in frame
  • Movement , the mask follows the subject across the frame
  • Partial occlusion , arms, hair, and edges are tracked accurately
  • Fast cuts and scene changes
Best source material: well-lit video with clear contrast between the subject and the background. Solid or simple backgrounds give the cleanest masks. Busy or cluttered backgrounds still work but produce softer edges.

🧠 Dynamic placement , how the engine decides where text goes

This is what separates Composite Captions from anything else on the market. The engine doesn't just stick text in one fixed spot on every screen.

For each text screen, Sleepy Motion's placement algorithm analyzes the current frame , subject position, how much of the frame they occupy, where they are moving , and decides the optimal position for that specific moment. Text moves, breathes, and repositions throughout the video in a way that feels intentional and designed, not templated.

What dynamic placement does

  • βœ“ Finds open space in the frame per screen
  • βœ“ Avoids covering the subject's face
  • βœ“ Adapts to movement and repositioning
  • βœ“ Creates natural visual rhythm across the video

The result

  • βœ“ Every screen feels intentionally composed
  • βœ“ Text never fights the subject for attention
  • βœ“ Looks like a director made layout decisions
  • βœ“ No two screens look identical

If you prefer full manual control over placement, switch the Position setting from Dynamic to Fixed , you can then use the transform positioner to set an exact location that applies consistently across all screens.

πŸ“‹ How to use it: step by step

  1. 1

    Switch to Captions mode

    At the top of the form, click the Captions tab. The sub-switcher defaults to Composite , leave it there.

  2. 2

    Upload your vertical video

    MP4 only, vertical format (9:16), up to 150MB, up to 30 seconds. The upload area shows a thumbnail preview once the file is loaded. Hit the X on the thumbnail to swap it out.

  3. 3

    Set the Caption language

    Pick the language spoken in the video. This is not a translation , it tells the transcription engine what language to expect, which dramatically improves accuracy. If you leave it on Auto, the engine will detect it, but setting it manually is always more reliable.

  4. 4

    Adjust Effects and Blend if needed

    Leave Effects on None and Blend on Normal for your first generation. Once you see the result you can experiment , Fluid adds a deformer effect to the graphics, and Blend modes change how the graphics layer interacts with the video.

  5. 5

    Hit Generate

    The engine transcribes your audio, segments the subject, places and renders the motion graphics behind them, and delivers the finished video. Takes a few minutes.

πŸ“ Position: Dynamic vs Fixed

The Position setting is only available in Composite mode and controls how the engine places each text screen.

Dynamic (default)

The engine analyzes each frame and chooses the best position for that specific moment. Text placement varies throughout the video , it follows the composition, avoids the subject's face, and creates a natural, editorial rhythm.

Best for: most videos

Fixed

Every text screen appears at a consistent position you define using the transform positioner. Predictable, stable, good for videos where you know exactly where you want text to live.

Best for: controlled compositions

When Position is set to Fixed, the transform positioner appears in Classic mode , drag the text placeholder in the video preview to set your position, and use the size and rotation controls to fine-tune.

✨ Effects and Blend modes

These two controls let you push the visual treatment further once you have a solid base result.

Effects

  • None , standard Sleepy Motion motion graphics, clean and sharp.
  • Fluid , adds a fluid deformer to the graphics layer, creating organic, liquid-like distortion in the text and shapes. High energy, very distinctive look.

Blend modes

Blend modes control how the motion graphics layer interacts with the video frames behind it. They behave identically to blend modes in After Effects or Photoshop.

  • Normal , graphics render as solid elements, no interaction with the video.
  • Screen , dark areas of the graphics become transparent, light areas glow through. Good for light, airy treatments.
  • Overlay , increases contrast and saturation where the graphics meet the video. Punchy, high-energy feel.
  • Soft Light , a gentler version of Overlay. Adds depth without harshness.
  • Difference , inverts colors where layers overlap. Surreal, experimental. Use deliberately.

Recommended starting point

Start with Normal and generate once. Then try Screen if your video has a lot of dark areas, or Overlay if you want maximum visual punch. Blend modes change the feel dramatically , always compare against your Normal baseline before committing.

βœ… Best practices for great results

  • πŸŽ₯

    Use clean, well-lit footage

    Good lighting with clear contrast between subject and background gives the masking algorithm the most to work with. Low-light or heavily compressed video will produce softer edges.

  • πŸ—£οΈ

    Set the language manually

    Auto-detection works, but specifying the spoken language always produces more accurate transcription. A wrong word on a caption screen undermines an otherwise perfect result.

  • ⏱️

    Keep videos under 30 seconds for now

    The current limit is 30 seconds. Longer-form support is coming. For content that runs long, trim to the most impactful 20–30 seconds before uploading.

  • 🎭

    Start with Dynamic placement

    The algorithm is good at finding open space in the frame. Let it run first , you will almost always prefer the dynamic result over a fixed position. Switch to Fixed only if you have a specific reason.

  • πŸ”

    Retry to refine

    Like all Sleepy Motion modes, the Edit & Retry workflow applies. If the first generation is 80% there, use the transform positioner and language setting to dial in the remaining 20% without starting over.

The honest bar

Composite Captions produces results that match what a skilled After Effects editor would deliver in hours , automatically, in minutes. The masking is not pixel-perfect on every frame of every video. It is production-quality for the vast majority of well-shot vertical content. Treat the first generation as a strong draft, use Retry to polish, and you will consistently get results worth publishing.

Ready to put this into practice?

Open the generator, set your brand colors first, and hit Generate.

Start generating β†’
Composite Captions: Motion Graphics Behind Your Subject Tutorial | Sleepy Motion