How To
How to Generate Subtitles Automatically
Whisper AI generates word-level subtitles from your audio — edit, style, and export in one workflow.
Step-by-Step Instructions
- 1
Drop your audio or video file
Upload any audio or video file. The tool extracts the audio track automatically from video files. Supported formats include MP4, MOV, WebM, MP3, WAV, and more.
- 2
Select your Whisper model
Choose between tiny, base, and small models. The base model is the best default — fast and accurate for English. For other languages or accented speech, the small model provides improved accuracy at the cost of slightly longer processing time.
Tip: Base.en is optimized for English and runs fastest. Use the multilingual small model for non-English content.
- 3
Review and edit the transcript
Whisper generates word-level subtitles with timestamps. Review every word in the visual editor. Fix any misheard words, remove filler content, and adjust timing for words that need correction.
- 4
Style your subtitles
Choose fonts, colors, position, and animation effects. Unlike basic subtitle generators that produce plain SRT files, VideoCaptions.AI gives you full visual control over how your subtitles look on screen.
- 5
Export as burned-in MP4
Export your video with subtitles composited directly into the video frames. The result is a single MP4 file that displays your styled subtitles on any device or platform without requiring a separate subtitle file.
01
Automatic Subtitles vs. Manual Transcription
Manual transcription is accurate but painfully slow. Professional transcriptionists work at roughly four times real time — a one-minute video takes four minutes to transcribe. For longer content, this becomes prohibitively expensive at typical transcription rates. AI-powered automatic subtitles change the equation entirely. Whisper transcribes a one-minute clip in approximately 15 to 30 seconds, depending on the model and your device's processing power. The accuracy is high enough that you spend your time making minor corrections rather than typing from scratch. This workflow — AI generates a draft, human reviews and polishes — is orders of magnitude faster than manual transcription while producing results that are equal in quality after review. VideoCaptions.AI makes this workflow seamless by combining transcription, editing, styling, and export into a single browser-based tool. You never need to copy transcripts between applications, import SRT files, or deal with subtitle timing formats.
02
How Whisper AI Generates Word-Level Subtitles
Whisper is OpenAI's open-source speech recognition model, and it represents a significant leap in transcription technology. Unlike older speech-to-text systems that work on sentence or phrase level, Whisper produces word-level timestamps — every individual word gets a precise start time and duration. This granularity is what enables advanced subtitle features like karaoke-style word highlighting and per-word animation effects. VideoCaptions.AI runs Whisper entirely in your browser using WebAssembly, which means your audio is processed locally on your device. The model supports 99 languages natively, handling accented speech, code-switching between languages, and various speaking styles from formal presentations to casual conversation. After Whisper generates the raw transcript, the tool organizes words into timed scenes based on sentence boundaries and your chosen words-per-page setting. Each scene becomes a visual unit in the editor where you can adjust text, timing, effects, and positioning before export.
Frequently Asked Questions
Everything you need to know before you start.
Can't find what you're looking for? Contact us
Whisper achieves high accuracy for clear speech in supported languages. English content with a single speaker in good audio conditions produces nearly perfect transcripts. Accuracy decreases with background noise, heavy accents, overlapping speakers, or specialized technical jargon. Always review the transcript before exporting.
Whisper supports 99 languages including English, Spanish, French, German, Hindi, Arabic, Japanese, Korean, Mandarin, Portuguese, and many more. English has the highest accuracy, with major world languages performing well. Less common languages may have lower accuracy.
VideoCaptions.AI currently exports subtitles as burned-in MP4 video. The subtitles are rendered directly into the video frames with your chosen styling and animation effects. SRT export is on the roadmap for a future update.
Speed depends on your device and the Whisper model selected. The tiny model is fastest but least accurate. The base model transcribes one minute of audio in roughly 15 to 30 seconds on a modern laptop. The small model takes about twice as long but produces better accuracy for challenging audio.