Hacker Newsnew | past | comments | ask | show | jobs | submit | p1necone's commentslogin

Every so often I try out a GPT model for coding again, and manage to get tricked by the very sparse conversation style into thinking it's great for a couple of days (when it says nothing and then finishes producing code with a 'I did x, y and z' with no stupid 'you're absolutely' right sucking up and it works, it feels very good).

But I always realize it's just smoke and mirrors - the actual quality of the code and the failure modes and stuff are just so much worse than claude and gemini.


I am a novice programmer -- I have programmed for 35+ years now but I build and lose the skills moving between coder to manager to sales -- multiple times. Fresh IC since last week again :) I have coded starting with Fortran, RPG and COBOL and I have also coded Java and Scala. I know modern architecture but haven't done enough grunt work to make it work or to debug (and fix) a complex problem. Needless to say sometimes my eyes glaze over the code.

And I write some code for my personal enjoyment, and I gave it to Claude 6-8 months back for improvement, it gave me a massive change log and it was quite risky so abandoned it.

I tried this again with Gemini last week, I was more prepared and asked it to improve class by class, and for whatever reasons I got better answers -- changed code, with explanations, and when I asked it to split the refactor in smaller steps, it did so. Was a joy working on this over the thanksgiving holidays. It could break the changes in small pieces, talk through them as I evolved concepts learned previously, took my feedback and prioritization, and also gave me nuanced explanation of the business objectives I was trying to achieve.

This is not to downplay claude, that is just the sequence of events narration. So while it may or may not work well for experienced programmers, it is such a helpful tool for people who know the domain or the concepts (or both) and struggle with details, since the tool can iron out a lot of details for you.

My goal now is to have another project for winter holidays and then think through 4-6 hour AI assisted refactors over the weekends. Do note that this is a project of personal interest so not spending weekends for the big man.


> I was more prepared and asked it to improve class by class, and for whatever reasons I got better answers

There is a learning curve with all of the LLM tools. It's basically required for everyone to go through the trough of disillusionment when you realize that the vibecoding magic isn't quite real in the way the influencers talk about it.

You still have to be involved in the process, steer it in the right direction, and review the output. Rejecting a lot of output and re-prompting is normal. From reading comments I think it's common for new users to expect perfection and reject the tools when it's not vibecoding the app for them autonomously. To be fair, that's what the hype influencers promised, but it's not real.

If you use it as an extension of yourself that can type and search faster, while also acknowledging that mistakes are common and you need to be on top of it, there is some interesting value for some tasks.


For me the learning curve was learning to choose what is worth asking to Claude. After 3 months on it, I can reap the benefit: Claude produces the code I want right 80% of the time. I usually ask it: to create new functions from scratch (it truly shines at understanding the context of these functions by reusing other parts of the code I wrote), refactor code, create little tools (for example a chart viewer).

It really depends on what you're building. As an experiment, I started having Claude Code build a real-time strategy game a bit over a week ago, and it's done an amazing job, with me writing no code whatsoever. It's an area with lots of tutorials for code structure etc., and I'm guessing that helps. And so while I've had to read the code and tell it to refactor things, it has managed to do a good job of it with just relatively high level prodding, and produced a well-architected engine with traits based agents for the NPCs and a lot of well-functioning game mechanics. It started as an experiment, but now I'm seriously toying with building an actual (but small) game with it just to see how far it can get.

In other areas, it is as you say and you need to be on top of it constantly.

You're absolutely right re: the learning curve, and you're much more likely to hit an area where you need to be on top of it than one that it can do autonomously, at least without a lot of scaffolding in the form of sub-agents, and rules to follow, and agent loops with reviews etc., which takes a lot of time to build up, and often include a lot of things specific to what you want to achieve. Sorting through how much effort is worth it for those things for a given project will take time to establish.


I suspect the meta architecture can also be done autonomously though no one has got there yet, figuring out the right fractal dimension for sub agents and the right prompt context can itself be thought of as a learning problem.

I appreciate this narrative; relatable to me in how I have experienced and watched others around me experience the last few years. It's as if we're all kinda-sorta following a similar "Dunning–Kruger effect" curve at the same time. It feels similar to growing up mucking around with a ppp connection and Netscape in some regards. I'll stretch it: "multimodal", meet your distant analog "hypermedia".

My problem with Gemini is how token hungry it is. It does a good job but it ends up being more expensive than any other model because it's so yappy. It sits there and argues with itself and outputs the whole movie.

Breaking down requirements, functionality and changes into smaller chunks is going to give you better results with most of the tools. If it can complete smaller tasks in the context window, the quality will likely hold up. My go to has been to develop task documents with multiple pieces of functionality and sub tasks. Build one piece of functionality at a time. Commit, clear context and start the next piece of functionality. If something goes off the rails, back up to the commit, fix and rebase future changes or abandon and branch.

That’s if I want quality. If I just want to prototype and don’t care, I’ll let it go. See what I like, don’t like and start over as detailed above.


Interesting. From my experience, Claude is much better at stuff involving frontend design somehow compared to other models (GPT is pretty bad). Gemini is also good but often the "thinking" mode just adds stuff to my code that I did not ask it to add or modifies stuff to make it "better". It likes to 1 up on the objective a lot which is not great when you're just looking for it to do what you precisely asked it and nothing else.

I have never considered trying to apply Claude/Gemini/etc. to Fortran or COBOL. That would be interesting.

You can actually use Claude Code (and presumably the other tools) on non-code projects, too. If you launch claude code in a directory of files you want to work on, like CSVs or other data, you can ask it to do planning and analysis tasks, editing, and other things. It's fun to experiment with, though for obvious reasons I prefer to operate on a copy of the data I'm using rather than let Claude Code go wild.

I use Claude Code for "everything", and have just committing most things into git as a fallback.

It's great to then just have it write scripts, and then write skills to use those scripts.

A lot of my report writing etc. now involve setting up a git repo, and use Claude to do things like process the call transcripts from discovery calls and turn them into initial outlines and questions that needs followup, and tasks lists, and write scripts to do necessary analysis etc., so I can focus on the higher level stuff.


Side note from someone who just used Claude Code today for the first time: Claude Code is a TUI, so you can run it in any folder/with any IDE and it plays along nicely. I thought it was just another vscode clone, so I was pleasantly surprised that it didn't try to take over my entire workflow.

It's even better: It's a TUI if you launch it without options, but you can embed it in scripts too - the "-p" option takes a prompt, in which case it will return the answer, and you can also provide a conversation ID to continue a conversation, and give it options to return the response as JSON, or stream it.

Many of the command line agent tools support similar options.


They also have a vscode extension that compares with github copilot now, just so you know.

I was just giving my history :) but yes I am sure this could actually get us out of the COBOL lock-in which requires 70 years old programmers to continue working.

The last article I could find on this is from 2020 though: https://www.cnbc.com/2020/04/06/new-jersey-seeks-cobol-progr...


Or you could just learn cobol. Using an LLM with a language you don’t know is pretty risky. How do you spot the subtle but fatal mistakes they make?

I'm starting with Claude at work but did have an okay experience with OpenAi so far. For clearly delimited tasks it does produce working code more often than not. I've seen some improvement on their side compared to say, last year. For something more complex and not clearly defined in advance, yes, it does produce plausible garbage and it goes off the rails a lot. I was migrating a project and asked ChatGPT to analyze the original code base and produce a migration plan. The result seemed good and encouraging because I didn't know much about that project at that time. But I ended up taking a different route and when I finished the migration (with bits of help from ChatGPT) I looked at the original migration plan out of curiosity since I had become more familiar with the project by now. And the migration plan was an absolutely useless and senseless hallucination.

Use Codex for coding work

On the contrary, I cannot use the top Gemini and Claude models because their outputs are so out place and hard to integrate with my code bases. The GPT 5 models integrate with my code base's existing patterns seamlessly.

Supply some relevant files of your codebase in the ClaudeAI project area in the right part of the browser. Usually it will understand your architecture, patterns, principles

I'm using AI in-editor, all the models have full access to my code base.

You realize on some level all of these sort of anecdotes, though, are simply random coincidence .

NME at all - 5.1 codex has been the best by far.

How can you stand the excruciating slowness? Claude Code is running circles around codex. The most mundane tasks make it think for a minute before doing anything.

I use it on medium reasoning and it's decently quick. I only switch to gpt-5.1-codex-max xhigh for the most annoying problems.

By learning to parallelize my work. This also solved my problem with slow Xcode builds.

Well you can’t edit files while Xcode is building or the compiler will throw up, so I‘m wondering what you mean here. You can’t even run swift test in 2 agents at the same time, because swift serializes access for some reason.

Whenever I have more than 1 agent run Swift tests in a loop to fix things, and another one to build something, the latter will disturb the former and I need to cancel.

And then there’s a lot of work that can’t be parallelized, like complex git rebases - well you can do other things in a worktree, but good luck merging that after you‘ve changed everything in the repo. Codex is really really bad at git.


Yes these are horrible pain points. I can only hope Apple improves this stuff if it's true that they're adding MCP support throughout the OS which should require better multi-agent handling

You can use worktrees to have multiple copies building or testing at once

I'm a solo dev so I rarely use some git features like rebase. I work out of trunk only without branches (if I need a branch, I use a feature flag). So I can't help with that

What I did is build an Xcode MCP server that controls Xcode via AppleScript and the simulator via accessibility & idb. For running, it gives locks to the agent that the agent releases once it's done via another command (or by pattern matching on logs output or scripting via JS criteria for ending the lock "atomically" without requiring a follow-up command, for more typical use). For testing, it serializes the requests into a queue and blocks the MCP response.

This works well for me because I care more about autonomous parallelization than I do eliminating waiting states, as long as I myself am not ever waiting. (This is all very interesting to me as a former DevOps/Continuous Deployment specialist - dramatically different practices around optimizing delivery these days...)

Once I get this tool working better I will productize it. It runs fully inside the macOS sandbox so I will deploy it to the Mac App Store and have an iOS companion for monitoring & managing it that syncs via iCloud and TailScale (no server on my end, more privacy friendly). If this sounds useful to you please let me know!

In addition to this, I also just work on ~3 projects at the same time and rotate through them by having about 20 iTerm2 tabs open where I use the titles of each tab (cmd-i to update) as the task title for my sake.

I've also started building more with SwiftWASM (with SQLite WASM, and I am working on porting SQLiteData to WASM too so I can have a unified data layer that has iCloud sync on Apple platforms) and web deployment for some of my apps features so that I can iterate more quickly and reuse the work in the apps.


Yes, that makes sense to me. I cannot really put builds in a queue because I have very fine-grained updates that I tell my agents so they do need the direct feedback to check what they have just done actually works, or they will interfere with each other’s work.

I do strive to use Mac OS targets because those are easier to deal with than a simulator, especially when you use Bluetooth stuff and you get direct access to log files and SQLite files.

Solo devs have it way easier in this new world because there’s no strict rules to follow. Whatever goes, goes, I guess.


I found Codex got much better (and with some AGENTS.md context about it) at ignoring unrelated changes from other agents in the same repo. But making worktrees easier to spin up and integrate back in might be a better approach for you.

When the build fails (rather than functional failure), most of the time I like to give the failure to a brand new agent to fix rather than waste context on the original agent resolving it, now that they're good at picking up on those changes. Wastes less precious context on the main task, and makes it easier to not worry about which agent addresses which build failures.

And then for individual agents checking their own work, I rely on them inspecting test or simulator/app results. This works best if agents don't break tests outside the area they're working in. I try to avoid having parallel agents working on similar things in the same tree.

I agree on the Mac target ease. Especially also if you have web views.

Orgs need to adapt to this new world too. The old way of forcing devs generally to work on only one task at a time to completion doesn't make as much sense anymore even from the perspective of the strictest of lean principles. That'll be my challenge to figure out and help educate that transformation if I want to productize this.


How can I get in touch?

hn () manabi.io

I use the web ui, easy to parallelize stuff to 90% done. manually finish the last 10% and a quick test

For Xcode projects?

i workshop a detailed outline w it first, and once i'm happy w the plan/outline, i let it run while i go do something else

By my tests (https://github.com/7mind/jopa) Gemini 3 is somewhat better than Claude with Opus 4.5. Both obliterate Codex with 5.1

What's - roughly - your monthly spend when using ppt models? I only use fixed priced copilot, and my napkin maths says I'd be spending something crazy like $200/mo if I went ppt on the more expensive models.

They have subscriptions too (at least Claude and ChatGPT/Codex; I don't use Gemini much). It's far cheaper to use the subscriptions first and then switch to paying per token beyond that.

Something around 500 euros.

Codex is super cheap though even with the cheapest GPT subscription you get lots of tokens. I use 4.5 opus at work and codex at home tbh the differences are not that big if you know what you are doing.

NME = "not my experience" I presume.

JFC TLA OD...


I've been getting great results from Codex. Can be a bit slow, but gets there. Writes good Rust, powers through integration test generation.

So (again) we are just sharing anecdata


You're absolutely right!

Somehow it doesn't get on my nerves (unlike Gemini with "Of course").


Can you give some concrete example of programming problem task GPT fails to solve?

Interested, because I’ve been getting pretty good results with different tasks using the Codex.


Try to ask it to write some GLSL shaders. Just describe what you want to see and then try to run the shaders it outputs. It can output a UV-map or the simple gradient right, but when it comes to shaders a bit more complex it most of the time will not compile or run properly, sometimes mix GLSL versions, sometimes just straight make up things which don't work or output what you want.

Library/API conflicts are the biggest pain point for me usually. Especially breaking changes. RLlib (currently 2.41.0) and Gymnasium (currently 0.29.0+) have ended in circles many times for me because they tend to be out of sync (for multi-agent environments). My go to test now is a simple hello world type card game like war, competitive multi-agent with rllib and gymnasium (pettingzoo tends to cause even more issues).

Claude Sonnet 4.5 was able to figure out a way to resolve it eventually (around 7 fixes) and I let it create an rllib.md with all the fixes and pitfalls and am curious if feeding this file to the next experiment will lead to a one-shot. GPT-5 struggled more but haven't tried Codex on this yet so it's not exactly fair.

All done with Copilot in agent mode, just prompting, no specs or anything.


I posted this example before but academic papers on algorithms often have pseudo code but no actual code.

I thought it would be handy to use AI to make the code from the paper so a few months ago I tried to use Claude (not GPT, because I only have access to Claude) to recreate C++ code to implement the algorithms in this paper as practice for me in LLM use and it didn’t go well.

https://users.cs.duke.edu/~reif/paper/chen/graph/graph.pdf


I just tried it with GPT-5.1-Codex. The compression ratio is not amazing, so not sure if it really worked, but at least it ran without errors.

A few ideas how to make it work for you:

1. You gave a link to a PDF, but you did not describe how you provided the content of the PDF to the model. It might only have read the text with something like pdftotext, which for this PDF results in a garbled mess. It is safer to convert the pages to PNG (e.g. with pdftoppm) and let the model read it from the pages. A prompt like "Transcribe these pages as markdown." should be sufficient. If you can not see what the model did, there is a chance it made things up.

2. You used C++, but Python is much easier to write. You can tell the model to translate the code to C++ once it works in Python.

3. Tell the model to write unit tests to verify that the individual components work as intended.

4. Use Agent Mode and tell the model to print something and to judge whether the output is sensible, so it can debug the code.


Interesting. Thanks for the suggestions.

Completely failed for me running the code it changed in a docker container i keep running. Claude did it flawlessly. It absolutely rocks at code reviews but ir‘s terrible in comparison generating code

It really depends on what kind of code. I've found it incredible for frontend dev, and for scripts. It falls apart in more complex projects and monorepos

I find for difficult questions math and design questions GPT5 tends to produce better answers than Claude and Gemini.

Could you clarify what you mean by design questions? I do agree that GPT5 tends to have a better agentic dispatch style for math questions but I've found it has really struggled with data model design.

Same experience here. The more commonly known the stuff it regurgitates is, the fewer errors. But if you venture into RF electronics or embedded land, beware of it turning into a master of bs.

Which makes sense for something that isn’t AI but LLM.


At this point you are now forced to use the "AI"s as code search tools--and it annoys me to no end.

The problem is that the "AI"s can cough up code examples based upon proprietary codebases that you, as an individual, have no access to. That creates a significant quality differential between coders who only use publicly available search (Google, Github, etc.) vs those who use "AI" systems.


How would the AIs have access to proprietary codebases?

Microsoft owns github

I like goldilocks services, as big or as small as actually makes sense for your domain/resource considerations, usually no single http endpoint services in sight.

Once upon a time, that's what a microservice was. A monolith was the company's software all in one software package.

I think what changed things is FAAS came along and people started describing nanoservices as microservices which created really dumb decisions.

I've worked on a true monolith and it wasn't fun. Having your change rolled back because another team made a mistake and it was hard to isolate the two changes was really rough.


Looks like they built a new NAS, but kept using the same drives. Which given the number of drive bays in the NAS probably make up a large majority of the overall cost in something like this.

Edit: reading comprehension fail - they bought drives earlier, at an unspecified price, but they weren't from the old NAS - I agree, when lifetimes of drives are measures in decades and huge amounts of tbw it seems pretty silly to buy new ones every time.


MB and other elements are more concerning than the drives.

For system failure, yes, but not if data retention and recovery is your primary concern.

When building a device primarily used for storing personal things, I'd much prefer to save money on the motherboard and risk that failing than skimping on the drives themselves


Don't skimp on the power supply either. A dodgy PSU can torch all devices attached to it.

How do I know? I've had two drives and one MB fail in quick succession thanks to a silently failing power supply.


You actually want reliable MB & RAM to ensure data doesn't get corrupted in memory first. Since you have various ways of writing data to disks that offer you resiliency.

Eh, cheap motherboards aren't a panacea that can't hurt the rest of the hardware, I personally don't skimp on motherboards, and would much rather skimp on the drives themselves as I have redundancy and 1-2 drives failing wouldn't hurt too much. And data retention is my top priority.

Motherboards have fried connected hardware before, poor grounding/ESD protections, firmware bugs together with aggressive power management, wiring weirdness and power related faults have broken people's drives before.

What I've never heard about is a drive breaking something else in a system, but broken motherboards have taken friends with them more than once.


Not sure why you are being downvoted. The MB is a single point of failure in this system, the drives are not.

I’ve experienced many drive failures over the years, but never lost data due to RAID. Failing MB or PSU on the other hand has wiped out my entire system.


I have lost data due to errors in bios raid by Intel.

This is the funniest edit have read in while.

Waiting for the edit to the edit

In my defense - the paragraph under the 'Storage' header reads like what I said to me, whereas the 'Bulk Storage Hard Disk Drives' header says something kind of contradictory to that. ('Collection of brand new parts' vs 'my own decommissioned hard drives')

It's absolutely possible to use an LLM to generate code, carefully review, iterate and test it and produce something that works and is maintainable.

The vast majority of of LLM generated code that gets submitted in PRs on public GitHub projects is not that - see the examples they gave.

Reviewing all of that code on its merits alone in order to dismiss it would take an inordinate amount of time and effort that would be much better spent improving the project. The alternative is a blanket LLM generated code ban, which is a lot less effort to enforce because it doesn't involve needing to read piles and piles of nonsense.


I'm in a committed long term relationship. I absolutely do not want to shit in front of my partner (nor do they have any desire to watch).

Implementing all of those things is an order of magnitude more complex than any other first class primitive datatype in most languages, and there's no obvious "one right way" to do it that would fit everyones use cases - seems like libraries and standalone databases are the way to do it, and that's what we do now.

I feel like I'm going insane reading how people talk about "vulnerabilities" like this.

If you give an llm access to sensitive data, user input and the ability to make arbitrary http calls it should be blindingly obvious that it's insecure. I wouldn't even call this a vulnerability, this is just intentionally exposing things.

If I had to pinpoint the "real" vulnerability here, it would be this bit, but the way it's just added as a sidenote seems to be downplaying it: "Note: Gemini is not supposed to have access to .env files in this scenario (with the default setting ‘Allow Gitignore Access > Off’). However, we show that Gemini bypasses its own setting to get access and subsequently exfiltrate that data."


These aren't vulnerabilities in LLMs. They are vulnerabilities in software that we build on top of LLMs.

It's important we understand them so we can either build software that doesn't expose this kind of vulnerability or, if we build it anyway, we can make the users of that software aware of the risks so they can act accordingly.


Right; the point is that it's the software that gives "access to sensitive data, user input and the ability to make arbitrary http calls" to the LLM.

People don't think of this as a risk when they're building the software, either because they just don't think about security at all, or because they mentally model the LLM as unerringly subservient to the user — as if we'd magically solved the entire class of philosophical problems Asimov pointed out decades ago without even trying.


> Linux compatibility layer

I can't wait to play Windows PC games on a Linux compatibility layer (Proton) on a Fuschia compatibility layer (Starnix) and still have them inexplicably run smoother than on the system they were originally developed for.


> if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student.

With coding it feels more like working with two devs - one is a competent intermediate level dev, and one is a raving lunatic with zero critical thinking skills whatsoever. Problem is you only get one at a time and they're identical twins who pretend to be each other as a prank.


I've had fun putting "always say X instead of 'You're absolutely right'" in my llm instructions file, it seems to listen most of the time. For a while I made it 'You're absolutely goddamn right' which was slightly more palatable for some reason.

I've found that it still can't really ground me when I've played with it. Like, if I tell it to be honest (or even brutally honest) it goes wayyyyyyyyy too far in the other direction and isn't even remotely objective.

Yeah I tried that once following some advice I saw on another hn thread and the results were hilarious, but not at all useful. It aggressively nitpicked every detail of everything I told it to do, and never made any progress. And it worded all of these nitpicks like a combination of the guy from the ackchyually meme (https://knowyourmeme.com/memes/ackchyually-actually-guy) and a badly written Sherlock Holmes.

My advice would be: It can't agree with you if you don't tell it what you think. So don't. Be careful about leading questions (clever hans effect) though.

So better than "I'm thinking of solving x by doing y" is "What do you think about solving x by doing y" but better still is "how can x be solved?" and only mention "y" if it's spinning its wheels.


Have it say 'you're absolutely fucked'! That would be very effective as a little reminder to be startled, stop, and think about what's being suggested.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: