Give Claude Code Eyes: The /watch Skill That Lets Your Agent Watch Any Video
A Claude Code skill that downloads any video, extracts frames, transcribes audio, and hands everything to Claude — so it can answer questions about what it actually saw and heard, not what the title says.
Last updated: June 15, 2026
A Claude Code skill that downloads any video, extracts frames, transcribes audio, and hands everything to Claude — so it can answer questions about what it actually saw and heard, not what the title says.
This guide is reviewed for clarity, service accuracy, and AI-search readability. The next quarterly content review is tracked internally before unsupported metrics or client proof are added.
The gap Claude Code couldn't close
Claude Code can read a webpage, run a script, browse a repo, analyze an image. What it can't do natively is watch a video. Paste a YouTube link and it has to guess from the title, pull a transcript that misses half of what's on screen, or admit it can't help.
The
/watchInstalling the skill
bash# Claude Code (CLI) /plugin marketplace add bradautomates/claude-video /plugin install watch@claude-video # Codex / generic skills runtime git clone https://github.com/bradautomates/claude-video.git ~/.codex/skills/watch
For claude.ai (web): download
watch.skillHow to use it
- Analyze a video: prompt
/watch https://youtu.be/dQw4w9WgXcQ what happens at the 30 second mark? - Reverse-engineer content: prompt
/watch https://youtu.be/abc123 what hook did they open with? - Debug from a recording: prompt
/watch bug-repro.mov what's going wrong? - Summarize: prompt
/watch https://youtu.be/xyz summarize this - Focused window: prompt
/watch https://youtu.be/abc --start 0:45 --end 1:15 what changed on screen?
What the skill actually does
When you call
/watch- downloads the video into a temp directory (local files are probed in place)prompt
yt-dlp - extracts frames at a duration-aware rateprompt
ffmpeg - Transcript comes from native captions via (free), with Whisper API as fallbackprompt
yt-dlp - Frames + transcript are handed to Claude, which reads each frame as an image in parallel
- Claude answers grounded in what's on screen and in the audio
Zero config on first run —
yt-dlpffmpegbrewThe frame budget — why it matters for token cost
Every frame is an image token. The skill's auto-fps logic exists to avoid blowing your context budget on a long video:
- ≤30 seconds: ~30 frames — dense, essentially every key moment
- 30s–1 min: ~40 frames — still dense
- 1–3 min: ~60 frames — comfortable
- 3–10 min: ~80 frames — sparse but workable
- >10 min: 100 frames — use /prompt
--startfor focused questionsprompt--end
What people actually use it for
- Content analysis — break down a competitor's hook structure, pacing, and CTA from the actual video, not the title
- Bug reproduction — watch a screen recording of something broken and identify the frame where the issue appears
- Video summarization — structure and key moments faster than 2x speed
- Ad creative analysis — paste a TikTok or Reel URL and ask what made the first 3 seconds work
Frequently asked questions
What video sources does the skill support? Anything yt-dlp supports — YouTube, Loom, TikTok, X, Instagram, Vimeo, and hundreds of other platforms. Local
.mp4.mov.mkv.webmDo I need a Whisper API key? No, not for most videos. The skill first tries native captions via yt-dlp — free and instant. Whisper only kicks in for videos with no captions.
How do I get more detail on a specific moment in a long video? Use
--start--end/watch https://youtu.be/abc --start 2:30 --end 3:00 what's happening here?