Hacker Newsnew | past | comments | ask | show | jobs | submit | frabonacci's commentslogin

Thanks - trajectory export was key for us since most teams want both eval and training data.

On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.

Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.


Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.

Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?

The author's differential testing (2.3M random battles) is great as final validation, but the real lesson here is that modular testing should happen during the port, not after.

1. Port tests first - they become your contract 2. Run unit tests per module before moving on - catches issues like the "two different move structures" early 3. Integration tests at boundaries before proceeding 4. E2e/differential testing as final validation

When you can't read the target language, your test suite is your only reliable feedback. The debugging time spent on integration issues would've been caught earlier with progressive testing.


The real lesson... I mean, if all of this took 1 month, the TFA already did amazingly well. Next time they'll do even better, no doubt.


Thanks! On API call visibility - Lume's MCP interface doesn't expose outbound network traffic directly. It's focused on VM lifecycle (create, run, stop) and command execution, not network inspection.

For agent observability, we handle this at the Cua framework level rather than the VM level:

- Agent actions and tool calls are logged via our tracing integration (Laminar, OpenTelemetry) - You can see the full decision trace - what the agent saw, what it decided, what tools it invoked - For the "what HTTP requests actually went out" question, proxying is still the right approach. You could configure the VM's network to route through a transparent proxy, or set up mitmproxy inside the VM. We haven't built that into Lume itself since network inspection feels orthogonal to VM management.

That said, it's an interesting idea - exposing a proxy config option in Lume that automatically routes VM traffic through a capture layer. Would that be useful for your workflow?


MDM platforms can skip Setup Assistant, but they require the device to be pre-enrolled in Apple Business Manager before first boot - VMs can't be enrolled in ABM, so those hooks aren't available.

defaults write only works after you have shell access, which means Setup Assistant is already done.

There are tools that modify marker files like .AppleSetupDone via Recovery Mode, but that's mainly for bypassing MDM enrollment on physical Macs - you'd still need to create a valid user account with proper Directory Services entries, keychain, etc.

The VNC + OCR approach is less elegant but works reliably without needing to reverse-engineer macOS internals or rely on undocumented behaviors that might break between versions.


Surely your VNC script is guaranteed to break between versions


Thanks! On graphics - currently it's paravirtualized via Apple's Virtualization Framework, so basic 2D acceleration but no GPU passthrough. Fine for desktop use, web browsing, coding, productivity apps. Wouldn't recommend it for anything GPU-intensive though.

Good news is there are hints of GPU passthrough coming (_VZPCIDeviceConfiguration symbol appeared in Tahoe's Virtualization framework), so that might land in a future macOS release. We're keeping an eye on it.


Nice, thanks for sharing! It'd be interesting to integrate MIST into lume's ipsw command - right now Apple's native features in Apple Vz only provides download links for the latest supported version of the host, so grabbing older versions requires workarounds like this.


Both use Apple's Virtualization Framework, so core VM performance is similar. Main differences are around agent-first design (HTTP API, MCP server), unattended setup via VNC + OCR, and registry support for VM images.

We've also built a broader ecosystem on top - the Cua computer and agent framework for building computer-use agents: https://cua.ai/docs

We went through the comparison with Tart, Lima etc here: https://github.com/trycua/cua/issues/10


Thanks for answering, makes sense.

Not seeing any reference to Tart at that link. Tart also has registry support for VM images it treats them very much like Docker images, is that what you are doing too?

Is it worth putting a comparison up somewhere other than a Github thread? Seems to be a frequently asked question at this point.

Also worth drawing attention to Tart being source available not open source.


Thanks for the feedback! You're right that a proper comparison page beats hunting through GitHub issues.

We just put one together (with some help from Claude Code, naturally): https://cua.ai/docs/lume/guide/getting-started/comparison


Thanks much appreciated, the "Registry Support" section is weird though. Isn't GHCR an instance of an OCI registry? The when to choose Loom in the Tart section should also mention licensing, it is relevant at the choosing point.


Good catches, thanks! Just updated the page:

Fixed the registry description—you're right, GHCR is an OCI registry. Both tools use OCI-compatible registries, we just default to GHCR/GCS.

Added licensing to the "when to choose" sections.


Good changes, like the new theme too, I'd still match the two boxes if it were me (both should read OCI registry and optionally include GHCR but they should be identical)


> Lume automates the macOS Setup Assistant via VNC and OCR, creating ready-to-use VMs without manual clicking. Tart relies on Packer plugins for automation.

This feels disingenuous. Tart has unattended setup support as well, and it's based on the same VNC + OCR technique as Lume. In fact Tart had it first, and your approach seems to be heavily inspired by it. In addition the boot command instructions you're using came from https://github.com/cirruslabs/macos-image-templates/

The only material difference is whether it's built-in or integrated via Packer.


Fair point - both use VNC for unattended setup. The difference is implementation: Tart does it via a Packer plugin (Go), we built it natively in Swift with a customizable YAML schema that's less error-prone. User-facing difference is --unattended flag vs Packer workflow.


Yeah, Apple intentionally provides no unattended setup. Plus any process trying to control the UI programmatically needs explicit accessibility permissions, which defeats the purpose.

So we just click through like a human would via VNC. Version-specific but works with their security model rather than against it.


That's a great approach.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: