It's a lot faster compared to new developers who still cost magnitudes more from day 1. It's not cost effective to hand every task to someone senior. I still have juniors on teams because in the long term we still need actual people who need a path to becoming senior devs, but in financial terms they are now a drain.
You can feel free not to believe it, as I have no plans to open up my tooling anytime soon - though partly because I'm considering turning it into a service. In the meantime these tools are significantly improving the margins for my consulting, and the velocity increases steadily as every time we run into a problem we make the tooling revise its own system prompt or add additional checks to the harness it runs to avoid it next time.
A lot of it is very simple. E.g a lot of these tools can produce broken edits. They'll usually realise and fix them, but adding an edit tool that forces the code through syntax checks / linters for example saved a lot of pain. As does forcing regular test and coverage runs, not just on builds.
For one of my projects I now let this tooling edit without asking permission, and just answer yes/no to whether it can commit once it's ready. If no, I'll tell it why and review again when it thinks it's fixed things, but a majority of commit requests are now accepted on the first try.
For the same project I'm now also experimenting with asking the assistant to come up with a todo list of enhancements for it based on a high level goal, then work through it, with me just giving minor comments on the proposed list.
I'm vaguely tempted to let this assistant reload it's own modified code when tests pass and leave it to work on itself for a a while and see what comes of it. But I'd need to sandbox it first. It's already tried (and was stopped by a permissions check) to figure out how to restart itself to enable new functionality it had written, so it "understands" when it is working on itself.
But, by all means, you can choose to just treat this as fiction if it makes you feel better.
No, I am not disputing whatever productivity gains you seem to be getting. I was just curious if it LLMs feeding data into each other can work that well, knowing how long it took OpenAI to make ChatGPT properly count the number of "R"s in the word "strawberry". There is this effect called "Habsburg AI". I reckon the syntax-check and linting stuff is straightforward, as it adds a deterministic element to it, but what do you do about the more tricky stuff like dreamt up functions and code packages? Unsafe practices like sensitive exposing data in cleartext, Linux commands which are downright the opposite of what was prompted, etc? That comes up a fair amount of times and I am not sure that LLMs are going to self-correct here, without human input.
It doesn't stop them from making stupid mistakes. It does reduce the amount of time I have to deal with the stupid mistakes that they know how to fix if the problem is pointed out to them, so that I can focus on more focused diffs of cleaner code.
E.g. a real example: The tooling I mentioned at one point early on made the correct functional change, but it's written in Ruby and Ruby allows defining methods multiple times in the same class - the later version just overrides the former. This would of course be a compilation error in most other languages. It's a weakness of using Ruby with a careless (or mindless) developer...
But Rubocop - a linter - will catch it. So forcing all changes through Rubocop and just returning the errors to LLM made it recognise the mistake and delete the old method.
It lowers the cognitive load of the review. Instead of having to wade through and resolve a lot of cruft and make sense of unusually structured code, you can focus on the actual specific changes and subject those to more scrutiny.
And then my plan is to experiment with more semantic checks of the same style as what Rubocop uses, but less prescriptive, of the type "maybe you should pay extra attention here, and explain why this is correct/safe" etc. An example might be to trigger this for any change that involves reading a key or password field or card number whether or not there is a problem with it, and both trigger the LLM to "look twice" and indicate it as an area to pay extra attention to in a human review.
It doesn't need to be perfect, it just need to provide enough of a harness to make it easier for humans in the loop to spot the remaining issues.
Right, so you understand that any dev who already uses for example Github Copilot with various code syntax extensions already achieves whatever it is that your new service is delivering? I'd spare myself the effort if I were you.
It didn't start with the intent of being a service; I started with it because there were a number of things that Copilot or tools like Claude Code doesn't do well enough that annoyed me, and spending a few hours was sufficient to get to the point where it's now my primary coding assistant because it works better for me for my stack, and because I can evolve it further to solve the specific problems I need solved.
So, no, I'll keep doing this because doing this is already saving me effort for my other projects.
You can feel free not to believe it, as I have no plans to open up my tooling anytime soon - though partly because I'm considering turning it into a service. In the meantime these tools are significantly improving the margins for my consulting, and the velocity increases steadily as every time we run into a problem we make the tooling revise its own system prompt or add additional checks to the harness it runs to avoid it next time.
A lot of it is very simple. E.g a lot of these tools can produce broken edits. They'll usually realise and fix them, but adding an edit tool that forces the code through syntax checks / linters for example saved a lot of pain. As does forcing regular test and coverage runs, not just on builds.
For one of my projects I now let this tooling edit without asking permission, and just answer yes/no to whether it can commit once it's ready. If no, I'll tell it why and review again when it thinks it's fixed things, but a majority of commit requests are now accepted on the first try.
For the same project I'm now also experimenting with asking the assistant to come up with a todo list of enhancements for it based on a high level goal, then work through it, with me just giving minor comments on the proposed list.
I'm vaguely tempted to let this assistant reload it's own modified code when tests pass and leave it to work on itself for a a while and see what comes of it. But I'd need to sandbox it first. It's already tried (and was stopped by a permissions check) to figure out how to restart itself to enable new functionality it had written, so it "understands" when it is working on itself.
But, by all means, you can choose to just treat this as fiction if it makes you feel better.