Over-Engineering a Rocket League Montage Generator with Go and OCR
VIEW THE RESULT HERE: VIDEO
The Problem
One game I play the most is Rocket League. Sometimes, when I'm playing, I do something cool and I clip it.
"Clipping it" in this context means I use the Xbox Game Bar on my PC to save the last 30 seconds of footage. The clip then gets saved to a specific folder on my PC.
The ultimate goal of clipping it is to share it. Here is the issue though:
- Only about ~12 seconds of that clip is relevant (the setup + the goal).
- No one has the attention span for 30-second clips (myself included).
- I'd need to manually trim the clip to get it to an acceptable length.
- I've been playing Rocket League for years, but I've never gone back and trimmed them manually.
- I now have over 300+ 30-second, un-trimmed videos sitting on my hard drive.
The Solution
I decided that I wanted to automate trimming these 30-second clips down to the relevant ~11-12 seconds. There was no chance I'd ever go through all 300+ clips myself. More importantly, I wanted to automate creating a Montage of these trimmed clips, complete with transitions and music, without ever opening video editing software. I don't know how to edit videos, but I do know how to code.
I decided to over engineer a solution using Go, OCR, and FFmpeg.
Clip Trimming: The Goal Detector
The first challenge was automatically finding where the goal happened in the video file.
I realized that Rocket League is consistent: when a goal is scored, the text [PLAYERNAME] SCORED always appears in the exact center of the screen. If I could read that text programmatically, I would know the exact timestamp of the goal.
Why Apple Vision?
My first thought was to use Tesseract (an open-source OCR engine), but it’s notoriously slow and heavy to bundle. Since I’m developing on a Mac, I realized I could tap into the native Apple Vision framework (the same tech that lets you copy text from photos on your iPhone). It is blazingly fast and highly accurate.
To accomplish this, my Go program does the following:
- Extract frames using
ffmpegto extract one frame every second (1 FPS) from the video. - Instead of scanning the whole image, we crop a small region in the center where the text appears. This reduces noise and speeds up processing.
- OCR via CGO: I wrote a small Python bridge to pass these frames from Go to Apple's
VNRecognizeTextRequest. - Fuzzy Matching: OCR isn't always perfect (e.g., it might read "SC0RED" or "SCOR ED"). I implemented a Levenshtein distance check to fuzzy match the word "SCORED."
Once a match is found at timestamp $T$, the program tells FFmpeg to slice the video from $T - 11s$ to $T$.
The Montage: Fighting FFmpeg
After trimming the clips, I wanted to merge them into a single highlight reel. This turned out to be the hardest part.
Attempt 1: The "Concat" Demuxer
My first attempt used FFmpeg's concat method. It was instantaneous, but the result was jarring. It was just "Hard Cut" -> "Hard Cut" -> "Hard Cut." It felt less like a montage and more like a slideshow of chaos.
Attempt 2: Complex Filters & Transitions
I decided to use FFmpeg's xfade (video transition) and acrossfade (audio transition) filters to add a "Wipe Left" effect between every clip.
This introduced a new problem: argument list explosion.
Trying to chain 300 clips into a single FFmpeg command resulted in a filter graph string that was massive, causing the terminal to crash with Argument list too long.
The Final Solution: Recursive Batching
To solve this, I wrote a recursive batching system in Go:
- Chunking: The program splits the 300 clips into batches of 15.
- Rendering: It renders small mini-montages of those 15 clips using hardware acceleration (
h264_videotoolboxon Mac), which sped up encoding from ~45 minutes to ~3 minutes. - Stitching: Finally, it stitches the pre-rendered batches together into the final video.
The Finishing Touches
A montage isn't complete without music and a place to watch it.
- Audio mixing: I wrote a function that scans a directory for music, shuffles the playlist, and uses FFmpeg's
acrossfadeto mix the songs into a continuous radio station, fading them out exactly when the video ends. - The web player: I uploaded the result to a DigitalOcean Space and built a custom HTML5 player. I used HLS (HTTP Live Streaming) to ensure it loads instantly on mobile, and styled the UI with CSS to match the neon, cyberpunk aesthetic of Rocket League.
Conclusion
Was it worth spending 8 hours writing code to avoid 2 hours of video editing?
Absolutely. Not only did I save my clips from digital purgatory, but I now have a pipeline that can generate a new season highlight reel every week while I sleep.
VIEW THE RESULT HERE: VIDEO