Article

How to Transcribe YouTube Videos with AI

admin
Apr 29, 2026 6 min read
How to Transcribe YouTube Videos with AI

Most transcription tutorials start with “first, install yt-dlp.” Then you download the video, extract the audio track, convert it to the right format, and upload it to a speech-to-text API. Four steps before you get a single word of text.

deAPI skips all of that. You send a YouTube URL to the /audio/transcriptions endpoint, and Whisper Large V3 returns the transcript. One API call, one response, done. The same endpoint handles Twitch VODs, TikTok, Kick, X videos, and even X Spaces – swap the URL and everything else stays identical.

This guide walks through the full workflow with working code in Python and JavaScript, then covers three practical things you can build on top of transcription.

What You’re Working With

The model behind this is Whisper Large V3 – OpenAI’s 1.55-billion-parameter speech recognition model, trained on over 5 million hours of audio. It supports 99 languages, handles accents and background noise well, and produces word-level or sentence-level timestamps on demand.

Through deAPI, you access Whisper via a unified /audio/transcriptions endpoint that auto-detects what you’re sending:

SourceWhat you provide
YouTubeVideo URL
TwitchVOD URL
TikTokVideo URL
KickVideo URL
X (Twitter)Post URL with video
X SpacesSpace URL
Local fileAudio or video upload

No download step, no audio extraction, no format conversion. The platform handles that server-side.

Pricing: $0.021 per hour of audio. The $5 free credit you get on signup covers roughly 237 hours of transcription – enough to process an entire podcast backlog before spending anything.

Transcribe a YouTube Video: Step by Step

Step 1: Send the URL

Python:

import requests

API_KEY = "your_api_key_here"
BASE = "<https://api.deapi.ai/api/v2>"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json"
}

response = requests.post(f"{BASE}/audio/transcriptions",
    headers=HEADERS,
    data={
        "source_url": "<https://www.youtube.com/watch?v=dQw4w9WgXcQ>",
        "model": "WhisperLargeV3",
        "include_ts": True
    }
)

request_id = response.json()["data"]["request_id"]
print(f"Job submitted: {request_id}")

JavaScript:

const API_KEY = "your_api_key_here";
const BASE = "<https://api.deapi.ai/api/v2>";

const form = new FormData();
form.append("source_url", "<https://www.youtube.com/watch?v=dQw4w9WgXcQ>");
form.append("model", "WhisperLargeV3");
form.append("include_ts", "true");

const res = await fetch(`${BASE}/audio/transcriptions`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${API_KEY}`,
    "Accept": "application/json"
  },
  body: form
});

const { data } = await res.json();
console.log(`Job submitted: ${data.request_id}`);

Three fields are all you need: the URL, the model slug, and whether you want timestamps.

Step 2: Poll for the Result

deAPI processes transcription asynchronously. You submit the job, get a request_id, then check back until it’s done.

import time

while True:
    status = requests.get(
        f"{BASE}/jobs/{request_id}",
        headers=HEADERS
    ).json()

    if status["data"]["status"] == "done":
        result_url = status["data"]["result_url"]
        print("Transcription ready!")
        break
    elif status["data"]["status"] == "error":
        print(f"Failed: {status}")
        break

    time.sleep(5)

A 10-minute YouTube video typically processes in under 30 seconds. For production use, you can skip polling entirely by passing a webhook_url parameter – deAPI will POST the result to your server when the job finishes.

Step 3: Get the Transcript

Once the status is done, the response includes a result_url pointing to the transcript file. Download it to get the full text:

transcript = requests.get(result_url).text
print(transcript)

Shortcut for short videos: add return_result_in_response: true to your initial request. When the job finishes, the status response includes a result field with the full transcript text alongside the result_url. You can grab the text directly without a separate download call.

Sample Output (with timestamps)

Here’s what an actual response looks like – a transcription of a tech podcast episode, returned with timestamps enabled:

[0:00 - 0:04]  So the thing about large language models that nobody talks about
[0:04 - 0:09]  is how much of the cost comes from inference, not training.
[0:09 - 0:14]  Training is a one-time expense. Inference is what you pay
every single time a user hits your API.
[0:14 - 0:18]  And that ratio keeps getting worse as adoption grows.

Each segment includes a start and end timestamp in [M:SS – M:SS] format. Whisper auto-detected the language here without any explicit parameter – though setting it manually improves accuracy on short clips or mixed-language audio.

Beyond YouTube: Same Code, Different URL

The /audio/transcriptions endpoint doesn’t care where the video lives. Replace the YouTube URL with any supported platform and the code stays identical:

# Twitch VOD
"source_url": "<https://www.twitch.tv/videos/1234567890>"

# TikTok
"source_url": "<https://www.tiktok.com/@user/video/1234567890>"

# X (Twitter) video
"source_url": "<https://x.com/user/status/1234567890>"

# X Spaces
"source_url": "<https://x.com/i/spaces/1AbCdEfGhIjKl>"

X Spaces transcription is particularly useful – these live audio sessions aren’t replayable forever, and there’s no native transcript feature. Capturing them as text while they’re available opens up a content source most people ignore.

For local files, swap source_url for source_file and upload directly. Supported formats include AAC, MP3, WAV, FLAC, OGG for audio and MP4, AVI, WMV for video, with a 50 MB size limit.

Conditioning the Transcription with initial_prompt

Whisper Large V3 accepts an initial_prompt parameter – a text snippet that biases how the model interprets ambiguous audio. This is particularly valuable when your content uses specialized terminology that the model might misspell.

A few practical examples:

Tech podcast: Include project names, framework names, and acronyms the speakers use frequently. “Kubernetes” won’t get transcribed as “Cooper Netties” if the model sees it in the prompt.

Non-English content: Write the prompt in the same language as the audio. For a Polish interview, list the names and places that appear: “Wywiad z Janem Kowalskim o projekcie w Krakowie” anchors the model’s Polish diacritics and proper noun handling.

Business calls: Financial acronyms like EBITDA, CAPEX, and ARR are common failure points. Listing them in the prompt with their expected casing fixes most errors.

The prompt is limited to roughly 224 tokens and gets truncated from the front, so put your most important terms near the end.

Three Things You Can Build With This

Transcription is a building block, not a destination. Here’s where it gets interesting.

YouTube-to-blog-post pipeline. Transcribe a video, feed the text to an LLM with a prompt like “restructure this into a 1,000-word article with headers,” and you have a first draft in minutes. Content creators sitting on hundreds of videos are sitting on hundreds of unwritten blog posts.

Podcast search engine. Transcribe every episode, generate embeddings with deAPI’s text-to-embedding endpoint (BGE M3), and build semantic search across hours of audio. Your users type a question, and they land on the exact 30-second segment where the host discussed it – complete with a timestamp link.

Stream highlight detector. Transcribe Twitch or Kick VODs with timestamps enabled, then scan the text for high-energy moments – audience callouts, reaction phrases, topic shifts. Pair this with a simple keyword scorer and you can auto-generate highlight clips from 8-hour streams without watching a single minute.

Wrapping Up

The full loop is three API calls: submit a URL, poll for status, download the transcript. Whisper Large V3 handles 99 languages, timestamps, and noisy audio without any configuration beyond the model slug.

At $0.021 per hour, the free $5 credit covers over 237 hours of transcription. That’s enough to process your entire YouTube channel, podcast archive, or a month of team meeting recordings. For context, OpenAI charges $0.36/hour for the same Whisper Large V3 model – roughly 17× more for identical output.

Get your API key and try it – the docs have additional examples for webhooks, file uploads, and the legacy per-platform endpoints if you need them.

Start building with AI
in under a minute

Access all models from this article through a single REST API. Start with $5 free credits — no subscription, no credit card.

No subscription
No credit card required