← Back to blog

Cutting Computer Use Costs by 80%: VibPage's Hybrid Mode Optimization

· 4 min read

It Got Expensive Fast

After pivoting VibPage to web automation, I’d been using OpenAI’s Computer Use to control the browser. The concept is straightforward: take a screenshot → AI looks at it → decides where to click → take another screenshot → AI looks again… rinse and repeat until the task is done.

It works great. The AI can “see” everything on the screen and interact with any element, just like a human would.

But there’s a fatal flaw: it’s way too expensive.

I had it play a round of Flash Linez, a simple browser game. Dozens of screenshots back and forth, each one a high-res image sent to the AI for analysis. The result? $1.40 spent, and it didn’t even play well.

For my own testing, that’s fine. But for regular users? Completely unrealistic. Who’s going to pay $0.50 per task on a daily basis?

First Cut: Lower Resolution + Switch to JPG

Since image tokens are the biggest cost driver, that’s where I started.

Two simple changes:

  1. Lower the screenshot resolution — no need to send 4K images every time; the AI can work with less detail
  2. Switch from PNG to JPG — PNG is lossless and large; JPG is lossy but much smaller, and the AI can still understand it just fine

These two changes alone cut token consumption by about 60%.

Immediate results. But I knew there was more room to optimize.

Second Cut: DOM First, Computer Use as Fallback

Reducing image costs is “saving,” but the real power move is “using less.”

I started asking myself: does every operation really need a screenshot for the AI to look at?

The truth is, many web operations don’t require “seeing” at all. Clicking a button, filling in an input field, selecting from a dropdown — these can all be precisely located and executed through the DOM (the structured data of a web page). DOM is just text, and its token consumption is an order of magnitude lower than images.

So I designed a Hybrid Mode:

  1. Try DOM mode first — read the page structure, attempt to complete the operation using element selectors
  2. Fall back to Computer Use when DOM can’t handle it — for complex visual interfaces, Canvas-rendered content, or scenarios requiring precise coordinate clicks

In practice, 70-80% of web operations can be handled by DOM alone. Only a small fraction needs to fall back to Computer Use.

This second cut reduced token consumption by another ~90%.

Let’s Do the Math

Stacking both optimizations, the results speak for themselves:

ModeCost per typical task
Pure Computer Use (before)~$0.50
Lower resolution + JPG~$0.20
Hybrid Mode (DOM first)~$0.10

From $0.50 down to $0.10 — roughly 80% savings overall.

$0.10 per automated task is much more reasonable. For batch execution of simple scheduled tasks, costs can be even lower.

Why Not Just Use DOM from the Start?

You might wonder: if DOM is so great, why not go all-DOM from the beginning?

Because DOM has its limitations:

  • Some pages have extremely complex structures — DOM trees nested dozens of levels deep make it hard for AI to find the right elements
  • Some content is dynamically rendered — what’s visible on screen isn’t in the DOM
  • Some operations require visual judgment — like “click that red button in the middle of the page,” which is hard to locate through DOM alone

Computer Use’s advantage is WYSIWYG — the AI sees what you see and operates accordingly, just like a human would. You can’t lose that fallback capability.

So the best approach isn’t either-or, but hybrid: use DOM when you can, switch to Computer Use when you must.

Final Thoughts

Optimization is a fascinating process.

I started with a simple thought — “this is too expensive, I need to cut costs” — and ended up shaving off 80% through incremental improvements. The approach wasn’t even complicated: first reduce data size, then reduce the number of calls.

It reminds me of a broader truth: the first version of any technical solution is usually “just get it working.” The real optimization opportunities only become visible after you’ve got it running.

VibPage now runs in hybrid mode with nearly the same effectiveness as pure Computer Use, but at one-fifth the cost. That means more people can actually afford AI web automation — and that’s exactly what I want.

The project remains fully open source: