Admitting what you know + Gorge_mini

Jim Clover
Sep 16, 2025
11 min read

“…but how am I doing this?”

It’s a question I’ve been asking myself a LOT lately. Call it imposter syndrome but without the negative aspects of that (I’m happy in my hacky code, AI-enabled software skin and would never claim to be a software developer by trade). How it is that I'm doing all of these applications using AI, securing the code, reviewing it by eye to debloat/catch AI 'slop' injections and getting stuff to production state? I know it's 2025 and AI and all that, but...?

What triggered this question was playing with my recent project you can read more about on the Varadius blog https://www.varadius.com/post/project-gorge-using-offline-ai-to-analyse-video-files. In simple terms, Gorge take an offline computer with a retail graphics card at the higher end (in this case, an Nvidia RTX 5090 with 32Gb of VRAM), a video file and processes it to create a text file narrative of whats inside said video clip. The goal was to use Augment Code for AI-assisted software development and create both the backend processor and user interface in 6 hours. Including UI, ability to upload, store, process etc.

And I did that. Cool. Cup of tea and medals for me. Great.

But how am I doing this, in these timeframes, and already getting interest from folks wanting to know how I am doing this processing?

It’s not all down to Augment Code being awesome and taking the strain, saving many days and days of reading, coding look ups and trials and errors and so on. It's a tool bottom line and it's great. Something else is going on here, that’s allowing me to power through prompts and challenge Augment Code (ultimately Claude Sonnet 4 at the far end) when it goes off the rails or doesn’t quite “get it”. What is that?

It’s experience. This isn't a boast, it's a fact of being 52 and having spent far too long loving technology, creating operational software for all sorts of environments and tasks, running technical teams in hardware and software and now for the last ten years working across a wide variety of commercial and non-commercial sectors as an advisor. The advent of AI basically gave me more tools to execute more things, super fast, with that experience and to coach others on how to do the same. Every day is a schoolday and it's brilliant.

Example: I asked Augment Code to create a hardware detector for a new version of Gorge, gorge_mini. This is a command line non-UI version for bulk processing (give it a thousand video files and off it goes is the plan) and I want it to be as friendly and easy to use as possible. A hardware detection phase is thus needed so that gorge_mini knows if it’s going to use Nvidia, Apple Silicon or your smart fridge for processing. What VRAM is available? Or is this unified RAM on Apple? Can it run at all? And so on. It also needs to shuttle its results to the processing phase, so that calculations can take place (more on this later) to ensure everything fits.

Augment Code finished the hardware detector end to end (apparently), I ran it, and whilst it seemed to work, the text output describing the video (which by now I’d watched 50ish times in trials so I knew what to expect in the text) was “OK”. It wasn’t good enough for me however and in turn, I wasn’t going to shove this in front of a potential user/customer. I checked my prompt to the AI (the prompt instructing the LLM Vision Model on what do with the video at hand) tightened it up a bit, but still…not happy. Something else was going on, holding back the quality text output…

A word then popped in my head - SAMPLING. Wait what? Why am I thinking sampling? OF COURSE! I knew enough about that this offline LLM and video processing from back in the day when I would have to review hundreds of clips, that it would be extracting frames from the video, like screenshots at so many frames in, to then analyse and formulate a story of what the video overall contains. So it relies on a reasonable volume of frames to achieve the overall assessment. For a short Ring doorbell clip, how many frames were we extracting? I asked Augment Code this question.

12.

Say what? I mean, that can’t be enough, can it? There are MANY more frames than this even in a short video clip obviously! How can you assume what a video contains from such a small batch of clips? Back into Augment Code, I asked how 12 was the hardcoded maximum amount of frames, when the hardware detector was supposed to ALSO share it’s findings with follow-on functions to shape performance + usage (of allowable frame extractions permitted, based on total VRAM). I knew that the calculation was already taking place:

Total VRAM available - Vision LLM load = Available VRAM for Frame Loading (works out at 21Gb VRAM for the LLM, some overhead, the rest free in Gb) for frames).

But 12 frames? Then the shock but also no surprise. After a manual review of the core processing code, I asked Augment Code to conduct a scientific (keyword, USE “scientific” in your prompts when you need the AI to sort itself out/caught in a death loop of debugging, works for me) review and…shocker…the reply was from the AI:

“I made it up”. Didn’t explain why it made it up, just made it up. With a trailing exclamation mark as if it admit a childish error. It also said that twice, with exclamation mark.

Now when it comes to moments like this with AI, serious developers can throw their toys ROYALLY out of the pram. I mean, full explosion. And they are mostly justified, as they have customer/boss pressure to get things over the line and will be angered that the AI junior developer (non-human, just a computer) properly tripped up here. It did the ‘h’ word. It HALLUCINATED! HOW DARE IT!

Staring at the screen, of course I’m annoyed that the AI I pay for placed a random limit that in turn reduced quality output. It is clear the hardware detector IS NOT working properly, it’s not doing the calculations as stated, critically needed for this to work to formulate the maximum allowable amount of frames to extract given the VRAM limitations. But why lose it over 12 frames? Why not explore with the AI WHY it got there, what it got wrong? From this AI exchange for 2 minutes or so, a new joint plan was formed. To harden the hardware detector, strengthen the VRAM calculations plus LLM load plus frame loading. A few tests later, things are now looking a TON better (in text output of what the video was about). GOOD. I’ve wasted 10 minutes fixing this. No biggy.

Could it be better? Oh yes, but it’s no longer entirely about code, it’s the limitations (currently 150 frames max for 3 minutes-ish video for a total fit) for my 32Gb of VRAM. Gorge-mini will attempt to run on 1Tb of VRAM (oh the dream) so I’m now hardware bound, my GPU is what it is and the next jump up in GPU purchase for my company is 5 figures.

Experience then kicks in again as I feel it's not over yet....I started to think....what if we could drop the resolution of the extracted video frames? Full fat frames extracted from a compressed video can often be bigger than the entire video itself - coz no longer compressed - so how about extracting them at a lower resolution, and having three modes of processing thus: Efficient (lower quality/resolution, but doable by the LLM, Standard (slightly lower quality) and High Quality (the full fat resolution)? So I asked the AI to push the boundaries, based on our current codebase and solution(s) and get creative, to explore this hardware deficit issue.

And after some back and forth (say 20 minutes) we got there.

So how does gorge_mini pull off this processing when clearly we’ve run out of VRAM when dealing with much bigger video files (and that doesn’t take much)? Batches of frames loaded in sequence. So instead of trying to push all the of the extracted frames onto the GPU VRAM (what’s left after the LLM is sat up there, remember) instead it batches up the frames to the maximum that will fit, processes them on GPU, releases the VRAM after ejecting the batch and moves to the next batch. Rinse, repeat. Voila! We have a different type of output text report now, that talks about phases of analysis rather than one flowing description (which you get when you process everything in one push on the GPU) - but it works. And it’s better sounding / more descriptive.

There are still things yet to try and then implement. I want to look at batches again and boost the numbers of frames per batch. Or grow the batch count dynamically to ensure that the Efficient, Standard and High Quality modes truly eek out the processing and VRAM available. I’m also testing out shortly adding GPT-OSS:20b to the mix - again offline, truly private, no token costs - to take the bigger batch jobs with their “Scene one, scene two” descriptions and summarise them all together into a forensic-style report. The first tests of this look very promising indeed.

I would conclude from my own work with AI tools and how fast I can produce applications like gorge and gorge_mini, amongst others, that:

- I’m only able to do this as I KNOW what the end state looks like. I can visualise good, and the bad. From the user interface to how I would demand it feels like, through to the minimal dials and switches so it’s not a scientific project just to process a video. I’ve lived great UI’s and bad ones. I want mine to be as good as it can be for me but more importantly, the end user who isn’t me, and thus I must tell the AI exactly how I want it to be, for them. Experience gave me that.

I have a strong idea about how the middleware and backend should work. I'm worried about using too many libraries to get this done, as I want it to be portable to any OS. I don't to go too far with complexity that it means it's impossible to run elsewhere. So the flow from UI to middle to back and back again HAS to be slick, and not full of excessive functionality but enough to make it stable and secure.

I don’t want processing to just live in the browser session. I want database operations to save states, to store summarisations, to hold metadata about what the video was about prior to processing and after. Again, things I know the user would ask to be added later, so do it now. Get ahead of teh ask even if is your own. Experience.

Knowing the limitations of offline LLMs compared to frontier (expensive in contrast, but much more powerful) LLMs online backed by 1000s of GPUs vice my one GPU. But in turn, know some tricks to get the best out of that single GPU, understand LLM crashes, VRAM limitations, appreciating the lack of resource to then squeeze the absolute maximum from it. This reminds me of the old days of 8 bit computing and 64 K of RAM (which never was that once you take the OS etc out as tax). But instead I’m starting my AI-enabled dev project on a consumer RTX 5090 with 32Gb and praying X Gb of frames can safely execute on the remainder. Experience (failure and success) gives me that.

Lastly, being unafraid of diving into the code. In the case above, it wasn’t AI that first spotted the 12 frames ball drop. I saw it whilst scrolling through the AI-generated main functions looking for clues and then asked Augment Code to do the scientific review knowing what I believed to be the issue - 12. I do this by default with AI creations. It’s not that I want to exert some form of human control over the process, moreover it is to critique and optimise something I WANT to work that has saved me so much time. “Why did you create function X when function Y has 80% of it already?” - experience again gives the confidence to do that. To accept (very minor) AI slop but then challenge it, and not be offended by it as it wasn’t purely crafted from my own hands, is the goal here.

The bottom line is that 2025's LLMs hand one MAJOR gift to hacky developers in their 50s like myself now using said experiences and these new tools to create applications that would have died in my memories, or cancelled at a customer meeting once we totted up the commitment, often due to developer costs (to hire more in) or the disruption to more pressing software requirements (read: keeping the current codebase fires burning with fixes). Without a willing, experienced and excited coding team (and the funding to employ them) immediately deploying on a new innovation, almost all of these ideas would have never transpired for me, and the same goes for my clients - until now, with AI part of the solution. A BIG PART. A single, willing developer (I'm taking school leaver, Uni Grad or Tier 1 Dev) plus AI can prototype in minutes, hours, a couple of weeks (based on their dev experience and status). I've seen them do it, I've coached them how to embrace these tools to enhance delivery timescales, bolster their already-sound developer talents. It really does work and the partnership between a developer and AI is something incredible once they fully understand the limitations and extraordinary benefits (once tamed) of thte 2025 AI toolbag. Experience also tells you that the sky won’t fall in when 12 is the AI-generated fault. It gently reminds you that humans get stuff wrong too, sometimes catastrophically as software history tells us. Things DO get better with time, and that includes AI models that are great at coding - but so does the human developer wrangling this technology to be optimal code expeditors. That patience with new technologies and finding ways to make them work better IS the key, the secret sauce (and has been since the release of Claude Sonnet, ChatGPT or any other LLM I've tested). If I have to create a "crutch markdown file" to enforce strict guidance to an AI, so be it if the results are stellar. I use AI to create the markdown, then eyeball it to mark the homework and manually refine if needed, but we are talking seconds here, minutes at best to create this file and fine tune the outcome.

Time is the most important thing in life, period, and if the AI co-coder I look to, to expedite my ideas to reality in software drops the ball, experience will often spot or catch the error before it causes a drama at production time. Meanwhile, the days, weeks and months (and costs) saved to reach MVP/working prototype, maybe not prod just yet but not far off, would have been worth the odd "12" along the way.

Note:

I am not affiliated in any way with Augment Code but have had contact with them on issues related to early releases, to pass my appreciation to the team for such a great product and to help improve the platform with no financial benefit forthcoming, then and now. It just happens that Augment Code impresses me the most as an AI Coder buddy and as a one-man band. If in time something ends up beating it, I would switch in a heartbeat, but so far, not the case.

I also encourage users of platforms like Lovable, Replit and so on to dive into the codebase generated by AI and ask the AI questions if they do not know what they are reading on screen. These platforms are as much tutors of software development as they are creators, but rarely used for that additional ability. Start to develop actual coding skills vice allowing AI to do all the work, which will advance significantly your experience using AI to create new technologies. Know how to lift the car bonnet and learn the engine basics, bottom line. You don't have to know how to redo the piston rings, but you should at least know how the thing works to maintain the running of it :)

(Gorge UI and gorge_mini have taken a total of 10 hours, mostly in the evenings, of my time, if this interests you :) I'm sure an experienced developer + AI would have done it in less, but I'm cool with that. )

Useful links:

https://www.linkedin.com/in/jim-clover-obe-996b8724/

https://www.varadius.com/post/project-gorge-using-offline-ai-to-analyse-video-files

Admitting what you know + Gorge_mini

Recent Posts

Comments

Send us a message
and we’ll get back to you shortly.

Comments

Send us a message and we’ll get back to you shortly.

Send us a message
and we’ll get back to you shortly.