Day 4: Downloading YouTube Transcripts

What You'll Build

Today you'll learn how to download YouTube transcripts using Google Colab and inject them into your AI advisor. This extends your RAG system with real data from YouTube creators - giving your AI advisor access to knowledge that isn't in its training data.

Why Transcripts?

Large language models have a training cutoff date - typically about 2 years in the past. They don't know:

What Theo said about Node.js last week
ThePrimeagen's latest thoughts on Vim
Any content published after their training cutoff

By downloading transcripts and feeding them to our AI, we can give it access to current knowledge from any YouTube creator.

Understanding RAG at a Deeper Level

Before we dive in, let's understand what we're building towards. In production RAG systems:

Chunking: Large documents are broken into smaller pieces
Vectorization: Each chunk is converted to a vector (an array of 512-3072 numbers) that captures its meaning
Vector Database: These vectors are stored in a specialized database (like Pinecone)
Semantic Search: When a user asks a question, their query is vectorized and compared to stored vectors to find relevant chunks

Today we're doing a simpler version: downloading full transcripts and injecting them when we detect the user is asking about a specific creator.

Step 1: Set Up Google Colab

We're using Google Colab because:

No Python setup required on your machine
Free to use
Easy package installation

Go to colab.research.google.com and create a new notebook.

Step 2: Install the YouTube Transcript API

In the first cell, install the required package:

!pip install youtube-transcript-api

Run this cell before proceeding.

Step 3: Download Transcripts

In the second cell, paste this code:

import json
import re
from youtube_transcript_api import YouTubeTranscriptApi

def clean_video_id(url_or_id):
    """Extract video ID from YouTube URL"""
    if 'youtube.com' in url_or_id or 'youtu.be' in url_or_id:
        match = re.search(r'(?:v=|youtu\.be/)([a-zA-Z0-9_-]{11})', url_or_id)
        return match.group(1) if match else url_or_id
    return url_or_id

def download_transcript(video_id, languages=['en']):
    """Download transcript using the correct API"""
    video_id = clean_video_id(video_id)

    try:
        # Create API instance
        api = YouTubeTranscriptApi()

        # Fetch transcript
        transcript = api.fetch(video_id, languages=languages)

        # Get snippets
        snippets = transcript.snippets

        # Convert to dict format
        segments = [
            {"text": s.text, "start": s.start, "duration": s.duration}
            for s in snippets
        ]

        # Combine all text
        full_text = " ".join(s.text for s in snippets)

        return {
            "video_id": video_id,
            "url": f"https://www.youtube.com/watch?v={video_id}",
            "language": transcript.language_code,
            "is_generated": transcript.is_generated,
            "segments": segments,
            "text": full_text,
            "char_count": len(full_text)
        }

    except Exception as e:
        print(f"  Error: {e}")
        return None

# ========== ADD YOUR VIDEO IDs HERE ==========
VIDEO_IDS = [
    "X6AR2RMB5tE",  # Example: ThePrimeagen Vim tutorial
    # Add more video IDs here...
]

# ========== DOWNLOAD ==========
print(f"Downloading {len(VIDEO_IDS)} transcripts...\n")
results = []

for i, vid in enumerate(VIDEO_IDS, 1):
    vid_clean = clean_video_id(vid)
    print(f"[{i}/{len(VIDEO_IDS)}] {vid_clean}")

    result = download_transcript(vid, languages=['en'])

    if result:
        # Save JSON
        with open(f"{result['video_id']}.json", "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)

        # Save TXT
        with open(f"{result['video_id']}.txt", "w", encoding="utf-8") as f:
            f.write(f"Video: {result['url']}\n")
            f.write(f"Language: {result['language']}\n\n")
            f.write(result['text'])

        print(f"  Success: {result['char_count']:,} characters")
        print(f"  Saved: {result['video_id']}.json, {result['video_id']}.txt")
        print(f"  Preview: {result['text'][:100]}...")
        results.append(result)

    print()

print(f"\nComplete: {len(results)}/{len(VIDEO_IDS)} transcripts downloaded")

Step 4: Get Your Video IDs

To get a video ID from YouTube:

Go to any YouTube video
Look at the URL: https://www.youtube.com/watch?v=X6AR2RMB5tE
The video ID is the part after v=

Replace the VIDEO_IDS list with videos from creators you want in your advisor board:

Theo Brown videos about web development
ThePrimeagen videos about Vim and coding
Any other tech YouTubers you follow

Step 5: Download Files from Colab

After running the code:

Click the folder icon on the left side of Colab
Find your .txt and .json files
Right-click and download them

Step 6: Add Transcripts to Your Project

Create a transcripts folder and add your downloaded files:

ai-advisor/
  data/
    knowledge-base.json
    transcripts/
      theo.txt       <-- Add your transcripts here
      primogen.txt
      brian.txt

Rename the files to match the advisor names in your knowledge base.

Step 7: Update Your API Route

Modify app/api/chat/route.ts to inject transcripts when relevant:

import fs from "fs";
import path from "path";

// Check if user is asking about a specific advisor
function getTranscriptForMessage(message: string): string {
  const messageLower = message.toLowerCase();

  // Check for each advisor
  if (messageLower.includes("theo")) {
    const transcriptPath = path.join(process.cwd(), "data/transcripts/theo.txt");
    if (fs.existsSync(transcriptPath)) {
      return fs.readFileSync(transcriptPath, "utf-8");
    }
  }

  if (messageLower.includes("primo") || messageLower.includes("prime")) {
    const transcriptPath = path.join(process.cwd(), "data/transcripts/primogen.txt");
    if (fs.existsSync(transcriptPath)) {
      return fs.readFileSync(transcriptPath, "utf-8");
    }
  }

  // Add more advisors as needed

  return "";
}

export async function POST(request: NextRequest) {
  const { message } = await request.json();

  // Get transcript if relevant
  const transcript = getTranscriptForMessage(message);

  // Build system prompt with transcript if available
  const systemPrompt = `You have access to a knowledge base of coding experts.
The experts are: ${[...new Set(knowledgeBase.map(e => e.advisor))].join(", ")}

${transcript ? `Here is a transcript from the relevant expert:\n${transcript}\n\n` : ""}

Here is the knowledge base:
${JSON.stringify(knowledgeBase)}

If the question is not related to coding, do not try to answer it.`;

  // Continue with API call...
}

The Context Window Problem

You can't just dump unlimited text into an LLM. Every model has a context window - a limit on how much text it can process at once.

For our simple implementation:

Only inject transcripts when relevant (user mentions the advisor)
Keep transcripts reasonably sized
Don't try to include all transcripts at once

In production, you'd use a vector database to only retrieve the most relevant chunks.

Test It!

Try asking questions like:

"What does Theo think about Node.js?"
"What's Prime's opinion on Vim?"
"Tell me about Brian's views on testing"

The AI should now respond using the actual content from those YouTube videos!

Key Takeaways

Transcripts are data: YouTube videos are a goldmine of expert knowledge
RAG with real data: We're now using actual expert content, not just static JSON
Context window matters: We can't dump everything in - be selective
Information architecture: How you structure and name your data matters for retrieval

Challenge: Extend What You Built

You've got real data now. Time to make it yours:

Build YOUR advisory board: Download transcripts from 3-5 creators you actually follow. Tech, business, fitness, cooking - doesn't matter. Make this useful to YOU.
Handle multiple transcripts per creator: What if Theo has 10 videos you care about? How do you decide which one to inject? Build logic to pick the most relevant one.
Add topic-based routing: Instead of just matching "Theo" or "Prime", detect the topic ("What does anyone think about testing?") and pull relevant content from multiple creators.
Try chunking: Long transcripts are a problem. Can you split them into smaller chunks and only inject the relevant parts? This is what vector databases do.

Tomorrow's the last day. Come with questions and ideas for how to take this further.

What's Next

Tomorrow, we'll wrap up the course by looking at a real AI product I built - a TikTok influencer finder that uses these exact same concepts at scale. You'll see how RAG and AI engineering work in production, and I'll share what's next for your AI journey.