Very different. A scantron machine is deterministic and non-chaotic.
In addition to being non-deterministic LLMs can product vastly different output from very slightly different input.
That’s ignoring how vulnerable LLMs are to prompt injection, and if this becomes common enough that exams aren’t thoroughly vetted by humans, I expect prompt attacks to become common.
Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
I saw this piece as the start of an experiment, and the use of a "council of AI" as they put it to average out the variability sounds like a decent path to standardization to me (prompt injecting would not be impossible, but getting something past all the steps sounds like a pretty tough challenge)
They mention getting 100% agreement between the LLMs on some questions and lower rates on other, so if an exam was composed of only questions where there is near 100% convergence, we'd be pretty close to a stable state.
I agree it would be reassuring to have a human somewhere in the loop, or perhaps allow the students to appeal the evaluation (at cost?) if they is evidence of a disconnect between the exam and the other criteria. But depending on how the questions and format is tweaked we could IMHO end up with something reliable for very basic assessments.
PS:
> Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
Nothing indeed. The arms race hasn't started here, and will keep going IMO.
So the whole thing is a complete waste of time then as an evaluation exercise.
>council of AIs
This only works if the errors and idiosyncrasies of different models are independent, which isn’t likely to be the case.
>100% agreement
When different models independently graded tests 0% of grades matched exactly and the average disagreement was huge.
They only reached convergence on some questions when they allowed the AIs to deliberate. This is essentially just context poisoning.
1 model incorrectly grading a question will make the other models more likely to incorrectly grade that question.
If you don’t let models see each other’s assessments, all it takes is one person writing an answer in a slightly different way that causes disagreement among models to vastly alter the overall scores by tossing out a question.
This is not even close to something you want to use to make consequential decisions.
A technological solution to a human problem is the appeal we have fallen for too many times these last few decades.
Humans are incredibly good at solving problems, but while one person is solving 'how do we prevent students from cheating' a student is thinking 'how I bypass this limitation preventing me from cheating'. And when these problems are digital and scalable, it only takes one student to solve that problem for every other student to have access to the solution.
Distance education is a tiny percentage of higher education though. Online classes at a local university are more common, but you can still bring the students in for proctored exams.
Even for distance education though, proctored testing centers have been around longer than the internet.
This is fine for us who've been building code by hand for many years before the advent of LLMs but it's definitely going to be a problem going forward.
Every useful CRUD app becomes its own special snowflake with time and users.
Now if your CRUD app never gets any users sure it stays generic. But we’ve had low code solutions that solve this problem for decades.
LLMs are good at stuff that probably should have been low code in the first place, but couldn’t be for reasons. That’s useful, but it comes with a ton of trade offs. And these kind of solutions covet a lot less ground than you’d think.
I'm old enough to remember the "OMG low-code is going to take our jeeeerbbs!" panic :D
Like LLMs they took away a _very_ specific segment of software, Zapier, n8n, NodeRED etc. do some things in a way that bespoke apps can't - but they also hit a massive hard wall where you either need to do some really janky shit or just break out Actual Code to get forward.
Depending on other people to maintain backward compatibility so that you can keep coding like it’s 2025 is its own problematic dependency.
You could certainly do it but it would be limiting. Imagine that you had a model trained on examples from before 2013 and your boss wants you to take over maintenance for a React app.
You're all referencing the strange idea in a world where there would be no open-weight coding models trained in the future. Even in a world where VC spending vanished completely, coding models are such a valuable utility that I'm sure at the very least companies/individuals would crowdsource them on a reoccurring basis, keeping them up to date.
The value of this technology has been established, it's not leaving anytime soon.
I think faang and the like would probably crowdsource it given that they would—according to the hypothesis presented—would only have to do it every few years, and ostensibly are realizing improved developer productivity from them.
I don’t think the incentive to open source is there for $200 million LLM models the same way it is for frameworks like React.
And for closed source LLMs, I’ve yet to see any verifiable metrics that indicate that “productivity” increases are having any external impact—looking at new products released, new games on Steam, new startups founded etc…
Certainly not enough to justify bearing the full cost of training and infrastructure.
2013 was pre-LLM. If devs continue relying on LLMs and their training would stop (which i would find unlikely), still the tools around the LLMs will continue to evolve and new language features will get less attention and would only be used by people who don't like to use LLMs. Then it would be a race of popularity between new language (features) and using LLMs steering 'old' programming languages and APIs. Its not always the best technology that wins, often its the most popular one. You know what happened during the browser wars.
Jai hasn’t even had the whole array of structs to struct of arrays thing in years.
Also speeding up compilation time really does require a new language or at least a new compiler.
And why would you “go after” any language. If you don’t like it, don’t use it. The only thing going after it is going to do is to drive up the engagement metrics and make it more popular.
>And why would you “go after” any language. If you don’t like it, don’t use it.
We are on a discussion forum. One of the common use cases of a discussion forum is criticism and debate. Yes, we could all simply use the tools we want, and not use the tools we don't, and not waste time expressing an opinion either way, but again this is a discussion forum.
And it's not as if I posted "Jai delenda est" here, I think my opinions are mild compared to what people here have to say about javascript, or C++ or PHP or any other language. I just don't think that a gamedev specific language is a good idea, compared to implementing libraries and frameworks in an existing language. I don't like the bespoke languages used by frameworks like Godot or GameMaker either.
Inflammable doesn’t mean not flammable. It means able to be inflamed. Language changes over time but this word is particularly problematic, so I’d avoid it to avoid confusion.
I can’t find anything about Casey Muratori or Jon Blow discussing Yandere Simulator.
But even if they did comment on the quality of the code, when they talk about so simplicity they are generally talking about avoiding unnecessary abstractions (usually OOP abstractions), not any of what the OP is talking about.
This isn’t about free market vs single payer healthcare. These kids are from poor countries. Unless you’re arguing for rich countries to offer literal worldwide healthcare.
In addition to being non-deterministic LLMs can product vastly different output from very slightly different input.
That’s ignoring how vulnerable LLMs are to prompt injection, and if this becomes common enough that exams aren’t thoroughly vetted by humans, I expect prompt attacks to become common.
Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
reply