I quit my previous job in part because I couldn’t deal with the influx of terrible, unreliable, dangerous, bloated, nonsensical, not even working code that was suddenly pushed into one of the projects I was working on. That project is now completely dead, they froze it on some arbitrary version.
When junior dev makes a mistake, you can explain it to them and they will not make it again. When they use llm to make a mistake, there is nothing to explain to anyone.
I compare this shake more to an earthquake than to anything positive you can associate with shaking.
More business for me. As a DevOps guy, my job is to create automation to flag “ terrible, unreliable, dangerous, bloated, nonsensical, not even working code”
And so, the problem wasn’t the ai/llm, it was the person who said “looks good” without even looking at the generated code, and then the person who read that pull request and said, again without reading the code, “lgtm”.
If you have good policies then it doesn’t matter how many bad practice’s are used, it still won’t be merged.
The only overhead is that you have to read all the requests but if it’s an internal project then telling everyone to read and understand their code shouldn’t be the issue.
The problem here is that a lot of the time looking for hidden problem is harder than writing good code from scratch. And you will always be at a danger that llm snuck some sneaky undefined behaviour past you. There is a whole plethora of standards, conventions, and good practices that help humans to avoid it, which llm can ignore at any random point.
So you’re either not spending enough time on review or missing whole lot of bullshit. In my experience, in my field, right now, this review time is more time consuming and more painful than avoiding it in the first place.
Don’t underestimate how degrading and energy sucking it is for a professional to spend most of the working time sitting through autogenerated garbage, and how inefficient it is.
A technology that makes people put bad code is a problematic technology. If your team/project managed to overcome it’s problems so far doesn’t mean it is good or overall helpful. Peoole not seeing the problem is actually the worst part.
I hardly see it changed to be honest. I work in the field too and I can imagine LLMs being good at producing decent boilerplate straight out of documentation, but nothing more complex than that.
I often use LLMs to work on my personal projects and - for example - often Claude or ChatGPT 4o spit out programs that don’t compile, use inexistent functions, are bloated etc.
Possibly for languages with more training (like Python) they do better, but I can’t see it as a “radical change” and more like a well configured snippet plugin and auto complete feature.
LLMs can’t count, can’t analyze novel problems (by definition) and provide innovative solutions…why would they radically change programming?
I hardly see it changed to be honest. I work in the field too and I can imagine LLMs being good at producing decent boilerplate straight out of documentation, but nothing more complex than that.
I think one of the top lists on advent of code this year is a cheater that fully automated the solutions using LLMs. Not sure which LLM though, I use LLMs quite a bit and ChatGPT 4o frequently tells me nonsense like “perhaps subtracting by zero is affecting your results” (issues I thought were already gone in GPT 4, but I guess not, Sonnet 3.5 does a bit better in this regard).
Maybe some postmortem analysis will be interesting.
The AoC is also a context in which the domain is self-contained and there is probably a ton of training material on similar problems and tasks.
I can imagine LLM might do decently there.
Also there is no big consequence if they don’t and it’s probably possible to bruteforce (which is how many programming tasks have been solved).
I think you’re spot on with LLMs being mostly trained on these kinds of tasks. Can’t say I’m an expert in how to build a training set, but I imagine it’s quite easy to do with these kinds of problems because it’s easy to classify a solution as correct or incorrect. This is in contrast to larger problems which are less guided by algorithmic efficiency and more by sound design/architecture.
Still, I think it’s quite impressive. You don’t have to go very far back in time to have top of the line LLMs unable to solve these kinds of problems.
Also there is no big consequence if they don’t and it’s probably possible to bruteforce (which is how many programming tasks have been solved).
Usually with AoC part 1 is brute-forceable, but part 2 is not. Very often part 1 is to find the 100th number, and part 2 is to find the 1 000 000 000 000th number or something. Last year, out of curiosity, I had a brute-force solution for one problem that successfully completed on ~90% of the input. Solution was multi-threaded and running on a 16 core CPU for about 20 days before I gave up. But the LLMs this year (not sure if this was a problem last year) are in the top list of fastest users to solve the problems.
Just to precise, when I said bruteforce I didn’t imagine a bruteforce of the calculation, but a brute force of the code. LLMs don’t really calculate either way, but what I mean is more: generate code -> try to run and see if tests work -> if it doesn’t ask again/refine/etc. So essentially you are just asking code until what it spits out is correct (verifiable with tests you are given).
But yeah, few years ago this was not possible and I guess it was not due to the training data. Now the problem is that there is not much data left for training, and someone (Bloomberg?) reported that training chatGPT 5 will cost billions of dollars, and it looks like we might be near the peak of what this technology could offer (without any major problem being solved by it to offset the economical and environmental cost).
That is my experience, it’s generally quite decent for small and simple stuff (as I said, distillation of documentation). I use it for rust, where I am sure the training material was much smaller than other languages. It’s not a matter a prompting though, it’s not my prompt that makes it hallucinate functions that don’t exist in libraries or make it write code that doesn’t compile, it’s a feature of the technology itself.
GPTs are statistical text generators after all, they don’t “understand” the problem.
It’s also pretty young, human toddlers hallucinate and make things up. Adults too. Even experts are known to fall prey to bias and misconception.
I don’t think we know nearly enough about the actual architecture of human intelligence to start asserting an understanding of “understanding”. I think it’s a bit foolish to claim with certainty that LLMs in a MoE framework with self-review fundamentally can’t get there. Unless you can show me, materially, how human “understanding” functions, we’re just speculating on an immature technology.
As much as I agree with you, humans can learn a bunch of stuff without first learning the content of the whole internet and without the computing power of a datacenter or consuming the energy of Belgium. Humans learn to count at an early age too, for example.
I would say that the burden of proof is therefore reversed. Unless you demonstrate that this technology doesn’t have the natural and inherent limits that statistical text generators (or pixel) have, we can assume that our mind works differently.
Also you say immature technology but this technology is not fundamentally (I.e. in terms of principle) different from what Weizenabum’s ELIZA in the '60s. We might have refined model and thrown a ton of data and computing power at it, but we are still talking of programs that use similar principles.
So yeah, we don’t understand human intelligence but we can appreciate certain features that absolutely lack on GPTs, like a concept of truth that for humans is natural.
No actually it has changed pretty fundamentally. These aren’t simply a bunch of FCNs put together. Look up what a transformer is, that was one of the major breakthroughs that made modern LLMs possible.
humans can learn a bunch of stuff without first learning the content of the whole internet and without the computing power of a datacenter or consuming the energy of Belgium. Humans learn to count at an early age too, for example.
I suspect that if you took into consideration the millions of generations of evolution that “trained” the basic architecture of our brains, that advantage would shrink considerably.
I would say that the burden of proof is therefore reversed. Unless you demonstrate that this technology doesn’t have the natural and inherent limits that statistical text generators (or pixel) have, we can assume that our mind works differently.
I disagree. I’d argue evidence suggests we’re just a more sophisticated version of a similar principle, refined over billions of years. We learn facts by rote, and learn similarities by rote until we develop enough statistical text (or audio) correlations to “understand” the world.
Conversations are a slightly meandering chain of statistically derived cliches. English adjective order is universally “understood” by native speakers based purely on what sounds right, without actually being able to explain why (unless you’re a big grammar nerd). More complex conversations might seem novel, but they’re just a regurgitation of rote memorized facts and phrases strung together in a way that seems appropriate to the conversation based on statistical experience with past conversations.
Also you say immature technology but this technology is not fundamentally (I.e. in terms of principle) different from what Weizenabum’s ELIZA in the '60s. We might have refined model and thrown a ton of data and computing power at it, but we are still talking of programs that use similar principles.
As with the evolution of our brains, which have operated on basically the same principles for hundreds of millions of years. The special sauce between human intelligence and a flatworm’s is a refined model.
So yeah, we don’t understand human intelligence but we can appreciate certain features that absolutely lack on GPTs, like a concept of truth that for humans is natural.
I’m not sure you can claim that absolutely. That kind of feature is an internal experience, you can’t really confirm or deny if a GPT has something similar. Besides, humans have a pretty tenuous relationship with the concept of truth. There are certainly humans that consider objective falsehoods to be Truth.
There is a lot that can be discussed in a philosophical debate.
However, any 8 years old would be able to count how many letters are in a word. LLMs can’t reliably do that by virtue of how they work.
This suggests me that it’s not just a model/training difference.
Also evolution over million of years improved the “hardware” and the genetic material. Neither of this is compares to computing power or amount of data which is used to train LLMs.
I believe a lot of this conversation stems from the marketing (calling “intelligence”) and the anthropomorphization of AI.
Anyway, time will tell. Personally I think it’s possible to reach a general AI eventually, I simply don’t think the LLMs approach is the one leading there.
There is a lot that can be discussed in a philosophical debate. However, any 8 years old would be able to count how many letters are in a word. LLMs can’t reliably do that by virtue of how they work. This suggests me that it’s not just a model/training difference. Also evolution over million of years improved the “hardware” and the genetic material. Neither of this is compares to computing power or amount of data which is used to train LLMs.
Actually humans have more computing power than is required to run an LLM. You have this backwards. LLMs are comparably a lot more efficient given how little computing power they need to run by comparison. Human brains as a piece of hardware are insanely high performance and energy efficient. I mean they include their own internal combustion engines and maintenance and security crew for fuck’s sake. Give me a human built computer that has that.
Anyway, time will tell. Personally I think it’s possible to reach a general AI eventually, I simply don’t think the LLMs approach is the one leading there.
I agree here. I do think though that LLMs are closer than you think. They do in fact have both attention and working memory, which is a large step forward. The fact they can only process one medium (only text) is a serious limitation though. Presumably a general purpose AI would ideally have the ability to process visual input, auditory input, text, and some other stuff like various sensor types. There are other model types though, some of which take in multi-modal input to make decisions like a self-driving car.
I think a lot of people romanticize what humans are capable of while dismissing what machines can do. Especially with the processing power and efficiency limitations that come with the simple silicon based processors that current machines are made from.
Exactly this. Things have already changed and are changing as more and more people learn how and where to use these technologies. I have seen even teachers use this stuff who have limited grasp of technology in general.
My kid’s teachers had what I thought was a fantastic approach - have the kids write an outline. Use an LLM to generate an essay from that outline, then critique the essay
Computer programming has radically changed. Huge help having llm auto complete and chat built in. IDEs like Cursor and Windsurf.
I’ve been a developer for 35 years. This is shaking it up as much as the internet did.
I quit my previous job in part because I couldn’t deal with the influx of terrible, unreliable, dangerous, bloated, nonsensical, not even working code that was suddenly pushed into one of the projects I was working on. That project is now completely dead, they froze it on some arbitrary version.
When junior dev makes a mistake, you can explain it to them and they will not make it again. When they use llm to make a mistake, there is nothing to explain to anyone.
I compare this shake more to an earthquake than to anything positive you can associate with shaking.
More business for me. As a DevOps guy, my job is to create automation to flag “ terrible, unreliable, dangerous, bloated, nonsensical, not even working code”
And so, the problem wasn’t the ai/llm, it was the person who said “looks good” without even looking at the generated code, and then the person who read that pull request and said, again without reading the code, “lgtm”.
If you have good policies then it doesn’t matter how many bad practice’s are used, it still won’t be merged.
The only overhead is that you have to read all the requests but if it’s an internal project then telling everyone to read and understand their code shouldn’t be the issue.
The problem here is that a lot of the time looking for hidden problem is harder than writing good code from scratch. And you will always be at a danger that llm snuck some sneaky undefined behaviour past you. There is a whole plethora of standards, conventions, and good practices that help humans to avoid it, which llm can ignore at any random point.
So you’re either not spending enough time on review or missing whole lot of bullshit. In my experience, in my field, right now, this review time is more time consuming and more painful than avoiding it in the first place.
Don’t underestimate how degrading and energy sucking it is for a professional to spend most of the working time sitting through autogenerated garbage, and how inefficient it is.
This is a problem with your team/project. It’s not a problem with the technology.
A technology that makes people put bad code is a problematic technology. If your team/project managed to overcome it’s problems so far doesn’t mean it is good or overall helpful. Peoole not seeing the problem is actually the worst part.
Sir, I use it to assist me in programming. I don’t use it to write entire files or functions. It’s a pattern recognizer.
Your team had people who didn’t review code. That’s a problem.
I hardly see it changed to be honest. I work in the field too and I can imagine LLMs being good at producing decent boilerplate straight out of documentation, but nothing more complex than that.
I often use LLMs to work on my personal projects and - for example - often Claude or ChatGPT 4o spit out programs that don’t compile, use inexistent functions, are bloated etc. Possibly for languages with more training (like Python) they do better, but I can’t see it as a “radical change” and more like a well configured snippet plugin and auto complete feature.
LLMs can’t count, can’t analyze novel problems (by definition) and provide innovative solutions…why would they radically change programming?
I think one of the top lists on advent of code this year is a cheater that fully automated the solutions using LLMs. Not sure which LLM though, I use LLMs quite a bit and ChatGPT 4o frequently tells me nonsense like “perhaps subtracting by zero is affecting your results” (issues I thought were already gone in GPT 4, but I guess not, Sonnet 3.5 does a bit better in this regard).
Maybe some postmortem analysis will be interesting. The AoC is also a context in which the domain is self-contained and there is probably a ton of training material on similar problems and tasks. I can imagine LLM might do decently there.
Also there is no big consequence if they don’t and it’s probably possible to bruteforce (which is how many programming tasks have been solved).
I think you’re spot on with LLMs being mostly trained on these kinds of tasks. Can’t say I’m an expert in how to build a training set, but I imagine it’s quite easy to do with these kinds of problems because it’s easy to classify a solution as correct or incorrect. This is in contrast to larger problems which are less guided by algorithmic efficiency and more by sound design/architecture.
Still, I think it’s quite impressive. You don’t have to go very far back in time to have top of the line LLMs unable to solve these kinds of problems.
Usually with AoC part 1 is brute-forceable, but part 2 is not. Very often part 1 is to find the 100th number, and part 2 is to find the 1 000 000 000 000th number or something. Last year, out of curiosity, I had a brute-force solution for one problem that successfully completed on ~90% of the input. Solution was multi-threaded and running on a 16 core CPU for about 20 days before I gave up. But the LLMs this year (not sure if this was a problem last year) are in the top list of fastest users to solve the problems.
Just to precise, when I said bruteforce I didn’t imagine a bruteforce of the calculation, but a brute force of the code. LLMs don’t really calculate either way, but what I mean is more: generate code -> try to run and see if tests work -> if it doesn’t ask again/refine/etc. So essentially you are just asking code until what it spits out is correct (verifiable with tests you are given).
But yeah, few years ago this was not possible and I guess it was not due to the training data. Now the problem is that there is not much data left for training, and someone (Bloomberg?) reported that training chatGPT 5 will cost billions of dollars, and it looks like we might be near the peak of what this technology could offer (without any major problem being solved by it to offset the economical and environmental cost).
Just from today https://www.techspot.com/news/106068-openai-struggles-chatgpt-5-delays-rising-costs.html
You’re missing it. Use Cursor or Windsurf. The autocomplete will help in so many tedious situations. It’s game changing.
ChatGPT 4o isn’t even the most advanced model, yet I have seen it do things you say it can’t. Maybe work on your prompting.
That is my experience, it’s generally quite decent for small and simple stuff (as I said, distillation of documentation). I use it for rust, where I am sure the training material was much smaller than other languages. It’s not a matter a prompting though, it’s not my prompt that makes it hallucinate functions that don’t exist in libraries or make it write code that doesn’t compile, it’s a feature of the technology itself.
GPTs are statistical text generators after all, they don’t “understand” the problem.
It’s also pretty young, human toddlers hallucinate and make things up. Adults too. Even experts are known to fall prey to bias and misconception.
I don’t think we know nearly enough about the actual architecture of human intelligence to start asserting an understanding of “understanding”. I think it’s a bit foolish to claim with certainty that LLMs in a MoE framework with self-review fundamentally can’t get there. Unless you can show me, materially, how human “understanding” functions, we’re just speculating on an immature technology.
As much as I agree with you, humans can learn a bunch of stuff without first learning the content of the whole internet and without the computing power of a datacenter or consuming the energy of Belgium. Humans learn to count at an early age too, for example.
I would say that the burden of proof is therefore reversed. Unless you demonstrate that this technology doesn’t have the natural and inherent limits that statistical text generators (or pixel) have, we can assume that our mind works differently.
Also you say immature technology but this technology is not fundamentally (I.e. in terms of principle) different from what Weizenabum’s ELIZA in the '60s. We might have refined model and thrown a ton of data and computing power at it, but we are still talking of programs that use similar principles.
So yeah, we don’t understand human intelligence but we can appreciate certain features that absolutely lack on GPTs, like a concept of truth that for humans is natural.
No actually it has changed pretty fundamentally. These aren’t simply a bunch of FCNs put together. Look up what a transformer is, that was one of the major breakthroughs that made modern LLMs possible.
I suspect that if you took into consideration the millions of generations of evolution that “trained” the basic architecture of our brains, that advantage would shrink considerably.
I disagree. I’d argue evidence suggests we’re just a more sophisticated version of a similar principle, refined over billions of years. We learn facts by rote, and learn similarities by rote until we develop enough statistical text (or audio) correlations to “understand” the world.
Conversations are a slightly meandering chain of statistically derived cliches. English adjective order is universally “understood” by native speakers based purely on what sounds right, without actually being able to explain why (unless you’re a big grammar nerd). More complex conversations might seem novel, but they’re just a regurgitation of rote memorized facts and phrases strung together in a way that seems appropriate to the conversation based on statistical experience with past conversations.
As with the evolution of our brains, which have operated on basically the same principles for hundreds of millions of years. The special sauce between human intelligence and a flatworm’s is a refined model.
I’m not sure you can claim that absolutely. That kind of feature is an internal experience, you can’t really confirm or deny if a GPT has something similar. Besides, humans have a pretty tenuous relationship with the concept of truth. There are certainly humans that consider objective falsehoods to be Truth.
Agree to disagree.
There is a lot that can be discussed in a philosophical debate. However, any 8 years old would be able to count how many letters are in a word. LLMs can’t reliably do that by virtue of how they work. This suggests me that it’s not just a model/training difference. Also evolution over million of years improved the “hardware” and the genetic material. Neither of this is compares to computing power or amount of data which is used to train LLMs.
I believe a lot of this conversation stems from the marketing (calling “intelligence”) and the anthropomorphization of AI.
Anyway, time will tell. Personally I think it’s possible to reach a general AI eventually, I simply don’t think the LLMs approach is the one leading there.
Actually humans have more computing power than is required to run an LLM. You have this backwards. LLMs are comparably a lot more efficient given how little computing power they need to run by comparison. Human brains as a piece of hardware are insanely high performance and energy efficient. I mean they include their own internal combustion engines and maintenance and security crew for fuck’s sake. Give me a human built computer that has that.
I agree here. I do think though that LLMs are closer than you think. They do in fact have both attention and working memory, which is a large step forward. The fact they can only process one medium (only text) is a serious limitation though. Presumably a general purpose AI would ideally have the ability to process visual input, auditory input, text, and some other stuff like various sensor types. There are other model types though, some of which take in multi-modal input to make decisions like a self-driving car.
I think a lot of people romanticize what humans are capable of while dismissing what machines can do. Especially with the processing power and efficiency limitations that come with the simple silicon based processors that current machines are made from.
Exactly this. Things have already changed and are changing as more and more people learn how and where to use these technologies. I have seen even teachers use this stuff who have limited grasp of technology in general.
My kid’s teachers had what I thought was a fantastic approach - have the kids write an outline. Use an LLM to generate an essay from that outline, then critique the essay