Day 4: Downloading YouTube Transcripts
What You'll Build
Today you'll learn how to download YouTube transcripts using Google Colab and inject them into your AI advisor. This extends your RAG system with real data from YouTube creators - giving your AI advisor access to knowledge that isn't in its training data.
Why Transcripts?
Large language models have a training cutoff date - typically about 2 years in the past. They don't know:
- What Theo said about Node.js last week
- ThePrimeagen's latest thoughts on Vim
- Any content published after their training cutoff
By downloading transcripts and feeding them to our AI, we can give it access to current knowledge from any YouTube creator.
Understanding RAG at a Deeper Level
Before we dive in, let's understand what we're building towards. In production RAG systems:
- Chunking: Large documents are broken into smaller pieces
- Vectorization: Each chunk is converted to a vector (an array of 512-3072 numbers) that captures its meaning
- Vector Database: These vectors are stored in a specialized database (like Pinecone)
- Semantic Search: When a user asks a question, their query is vectorized and compared to stored vectors to find relevant chunks
Today we're doing a simpler version: downloading full transcripts and injecting them when we detect the user is asking about a specific creator.
Step 1: Set Up Google Colab
We're using Google Colab because:
- No Python setup required on your machine
- Free to use
- Easy package installation
Go to colab.research.google.com and create a new notebook.
Step 2: Install the YouTube Transcript API
In the first cell, install the required package:
!pip install youtube-transcript-api
Run this cell before proceeding.
Step 3: Download Transcripts
In the second cell, paste this code:
import json
import re
from youtube_transcript_api import YouTubeTranscriptApi
def clean_video_id(url_or_id):
"""Extract video ID from YouTube URL"""
if 'youtube.com' in url_or_id or 'youtu.be' in url_or_id:
match = re.search(r'(?:v=|youtu\.be/)([a-zA-Z0-9_-]{11})', url_or_id)
return match.group(1) if match else url_or_id
return url_or_id
def download_transcript(video_id, languages=['en']):
"""Download transcript using the correct API"""
video_id = clean_video_id(video_id)
try:
# Create API instance
api = YouTubeTranscriptApi()
# Fetch transcript
transcript = api.fetch(video_id, languages=languages)
# Get snippets
snippets = transcript.snippets
# Convert to dict format
segments = [
{"text": s.text, "start": s.start, "duration": s.duration}
for s in snippets
]
# Combine all text
full_text = " ".join(s.text for s in snippets)
return {
"video_id": video_id,
"url": f"https://www.youtube.com/watch?v={video_id}",
"language": transcript.language_code,
"is_generated": transcript.is_generated,
"segments": segments,
"text": full_text,
"char_count": len(full_text)
}
except Exception as e:
print(f" Error: {e}")
return None
# ========== ADD YOUR VIDEO IDs HERE ==========
VIDEO_IDS = [
"X6AR2RMB5tE", # Example: ThePrimeagen Vim tutorial
# Add more video IDs here...
]
# ========== DOWNLOAD ==========
print(f"Downloading {len(VIDEO_IDS)} transcripts...\n")
results = []
for i, vid in enumerate(VIDEO_IDS, 1):
vid_clean = clean_video_id(vid)
print(f"[{i}/{len(VIDEO_IDS)}] {vid_clean}")
result = download_transcript(vid, languages=['en'])
if result:
# Save JSON
with open(f"{result['video_id']}.json", "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
# Save TXT
with open(f"{result['video_id']}.txt", "w", encoding="utf-8") as f:
f.write(f"Video: {result['url']}\n")
f.write(f"Language: {result['language']}\n\n")
f.write(result['text'])
print(f" Success: {result['char_count']:,} characters")
print(f" Saved: {result['video_id']}.json, {result['video_id']}.txt")
print(f" Preview: {result['text'][:100]}...")
results.append(result)
print()
print(f"\nComplete: {len(results)}/{len(VIDEO_IDS)} transcripts downloaded")
Step 4: Get Your Video IDs
To get a video ID from YouTube:
- Go to any YouTube video
- Look at the URL:
https://www.youtube.com/watch?v=X6AR2RMB5tE - The video ID is the part after
v=
Replace the VIDEO_IDS list with videos from creators you want in your advisor board:
- Theo Brown videos about web development
- ThePrimeagen videos about Vim and coding
- Any other tech YouTubers you follow
Step 5: Download Files from Colab
After running the code:
- Click the folder icon on the left side of Colab
- Find your
.txtand.jsonfiles - Right-click and download them
Step 6: Add Transcripts to Your Project
Create a transcripts folder and add your downloaded files:
ai-advisor/
data/
knowledge-base.json
transcripts/
theo.txt <-- Add your transcripts here
primogen.txt
brian.txt
Rename the files to match the advisor names in your knowledge base.
Step 7: Update Your API Route
Modify app/api/chat/route.ts to inject transcripts when relevant:
import fs from "fs";
import path from "path";
// Check if user is asking about a specific advisor
function getTranscriptForMessage(message: string): string {
const messageLower = message.toLowerCase();
// Check for each advisor
if (messageLower.includes("theo")) {
const transcriptPath = path.join(process.cwd(), "data/transcripts/theo.txt");
if (fs.existsSync(transcriptPath)) {
return fs.readFileSync(transcriptPath, "utf-8");
}
}
if (messageLower.includes("primo") || messageLower.includes("prime")) {
const transcriptPath = path.join(process.cwd(), "data/transcripts/primogen.txt");
if (fs.existsSync(transcriptPath)) {
return fs.readFileSync(transcriptPath, "utf-8");
}
}
// Add more advisors as needed
return "";
}
export async function POST(request: NextRequest) {
const { message } = await request.json();
// Get transcript if relevant
const transcript = getTranscriptForMessage(message);
// Build system prompt with transcript if available
const systemPrompt = `You have access to a knowledge base of coding experts.
The experts are: ${[...new Set(knowledgeBase.map(e => e.advisor))].join(", ")}
${transcript ? `Here is a transcript from the relevant expert:\n${transcript}\n\n` : ""}
Here is the knowledge base:
${JSON.stringify(knowledgeBase)}
If the question is not related to coding, do not try to answer it.`;
// Continue with API call...
}
The Context Window Problem
You can't just dump unlimited text into an LLM. Every model has a context window - a limit on how much text it can process at once.
For our simple implementation:
- Only inject transcripts when relevant (user mentions the advisor)
- Keep transcripts reasonably sized
- Don't try to include all transcripts at once
In production, you'd use a vector database to only retrieve the most relevant chunks.
Test It!
Try asking questions like:
- "What does Theo think about Node.js?"
- "What's Prime's opinion on Vim?"
- "Tell me about Brian's views on testing"
The AI should now respond using the actual content from those YouTube videos!
Key Takeaways
- Transcripts are data: YouTube videos are a goldmine of expert knowledge
- RAG with real data: We're now using actual expert content, not just static JSON
- Context window matters: We can't dump everything in - be selective
- Information architecture: How you structure and name your data matters for retrieval
Challenge: Extend What You Built
You've got real data now. Time to make it yours:
- Build YOUR advisory board: Download transcripts from 3-5 creators you actually follow. Tech, business, fitness, cooking - doesn't matter. Make this useful to YOU.
- Handle multiple transcripts per creator: What if Theo has 10 videos you care about? How do you decide which one to inject? Build logic to pick the most relevant one.
- Add topic-based routing: Instead of just matching "Theo" or "Prime", detect the topic ("What does anyone think about testing?") and pull relevant content from multiple creators.
- Try chunking: Long transcripts are a problem. Can you split them into smaller chunks and only inject the relevant parts? This is what vector databases do.
Tomorrow's the last day. Come with questions and ideas for how to take this further.
What's Next
Tomorrow, we'll wrap up the course by looking at a real AI product I built - a TikTok influencer finder that uses these exact same concepts at scale. You'll see how RAG and AI engineering work in production, and I'll share what's next for your AI journey.