Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
Sometimes just having the time/compute to explore the available space with known knowledge is enough to produce something unique.
We have at least 5 senses, our thoughts, feelings, hormonal fluctuations, sleep and continuous analog exposure to all of these things 24/7. It's vastly different from how inputs are fed into an LLM.
On top of that we have millions of years of evolution toward processing this vast array of analog inputs.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
I suppose the other side of it is that if you add what the model has figured out to the training set, it will always know it.
It's not to downplay this, but it's unclear what "novel" means here or what you think the implications are.
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
Hoping that won't be the case with AI but we may need some major societal transformations to prevent it.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.
2. Verifying that the proposed solution actually is a full solution.
This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"
1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.
2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.
3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.
Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
Even as context sizes get larger, this will likely be relevant. Specially since AI providers may jack up the price per token at any time.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
It's afterall still a remix machine, it can only interpolate between that which already exists. Which is good for a lot of things, considering everything is a remix, but it can't do truly new tasks.
The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.
You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.
They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).
Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...
[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
It's pretty much how all the hard problems are solved by AI from my experience.
Compared to AI, it thinks of every possible scientific method and tries them all. Not saying that humans never do this as well, but it's mostly reserved for when we just throw mud at a wall and see what sticks.
If we get to any sort of confidence it will work it is based on building a history of it, or things related to "it" working consistently over time, out of innumerable other efforts where other "it"s did not work.
Also
> humans are a lot better at (...)
That's maybe true in 2026, but it's hard to make statements about "AI" in a field that is advancing so quickly. For most of 2025 for example, AI doing math like this wouldn't even be possible
As for advances where there is a hypothesis, it rests on the shoulders of those who've come before. You know from observations that putting carbon in iron makes it stronger, and then someone else comes along with a theory of atoms and molecules. You might apply that to figuring out why steel is stronger than iron, and your student takes that and invents a new superalloy with improvements to your model. Remixing is a fundamental part of innovation, because it often teaches you something new. We aren't just alchemying things out of nothing.
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
A basic AI chat response also doesn't first discard all other possible responses.
The artist drew 10 pencil sketches and said "hmm I think this one works the best" and finished the painting based on it.
I said he didn't one shot it and therefore he has no ability to paint, and refused to pay him.
It remains to be seen whether this is genuinely intelligence or an infinite monkeys at infinite typewriters situation. And I'm not sure why this specific example is worthy enough to sway people in one direction or another.
"Even if every proton in the observable universe (which is estimated at roughly 1080) were a monkey with a typewriter, typing from the Big Bang until the end of the universe (when protons might no longer exist), they would still need a far greater amount of time – more than three hundred and sixty thousand orders of magnitude longer – to have even a 1 in 10500 chance of success. To put it another way, for a one in a trillion chance of success, there would need to be 10^360,641 observable universes made of protonic monkeys."
Often infinite things that are probability 1 in theory, are in practice, safe to assume to be 0.
So no. LLMs are not brute force dummies. We are seeing increasingly emergent behavior in frontier models.
Woah! That was a leap. "We are seeing ... emergent behaviors" does not follow from "it's not brute force".
It is unsurprising that an LLM performs better than random! That's the whole point. It does not imply emergence.
What? Did you see one crying?
We start writing all those formulas etc and if at some point we realise we went th wrong way we start from the begignning (or some point we are sure about).
You can make that claim about anything: "The human isn't being creative when they write a novel, they're just summoning patterns at typing time".
AlphaGo taught itself that move, then recalled it later. That's the bar for human creativity and you're holding AlphaGo to a higher standard without realizing it.
AlphaGo didn't teach itself that move. The verifier taught AlphaGo that move. AlphaGo then recalled the same features during inference when faced with similar inputs.
Ok so it sounds like you want to give the rules of Go credit for that move, lol.
I don't really play Go but I play chess, and it seems to me that most of what humans consider creativity in GM level play comes not in prep (studying opening lines/training) but in novel lines in real games (at inference time?). But that creativity absolutely comes from recalling patterns, which is exactly what OP criticizes as not creative(?!)
I guess I'm just having trouble finding a way to move the goalpost away from artificial creativity that doesn't also move it away from human creativity?
No. AlphaGo developed a heuristic by playing itself repeatedly, the heuristic then noticed the quality of that move in the moment.
Heuristics are the core of intelligence in terms of discovering novelty, but this is accessible to LLMs in principle.
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
An AI can probably do an 'okay' job at summarizing information for meta studies. But what it can't do is go "Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
LLMs will NEVER be able to do that, because it doesn't exist. They're not going to discover and define a new chemical, or a new species of animal. They're not going to be able to describe and analyze a new way of folding proteins and what implication that has UNLESS you basically are constantly training the AI on random protein folds constantly.
Kinda funny because that looked _very_ close to what my Opus 4.6 said yesterday when it was debugging compile errors for me. It did proceed to explore the other vector.
This is the crucial part of the comment. LLMs are not able to solve stuff that hasn't been solve in that exact or a very similar way already, because they are prediction machines trained on existing data. It is very able to spot outliers where they have been found by humans before, though, which is important, and is what you've been seeing.
This is very common already in AI.
Just look at the internal reasoning of any high thinking model, the trace is full of those chains of thought.
I mean, TFA literally claims that an AI has solved an open Frontier Math problem, descibed as "A collection of unsolved mathematics problems that have resisted serious attempts by professional mathematicians. AI solutions would meaningfully advance the state of human mathematical knowledge."
That is, if true, it reasoned out a proof that does not exist in its training data.
- Edit: I can't reply, probably because the comment thread isn't allowed to go too deep, but this is a good argument. In my mind the argument isn't that coding is harder than math, but that the problems had resisted solution by human researchers.
So really this is no different from generating any python program. There are also many examples of combinatoric construction in python training sets.
It's still a nice result, but it's not quite the breakthrough it's made out to be. I think that people somehow see math as a "harder" domain, and are therefore attributing more value to this. But this is a quite simple program in the end.
I'm curious though, how many novel Math proofs are not close enough to something in the prior art? My understanding is that all new proofs are compositions and/or extensions of existing proofs, and based on reading pop-sci articles, the big breakthroughs come from combining techniques that are counter-intuitive and/or others did not think of. So roughly how often is the contribution of a proof considered "incremental" vs "significant"?
Remember, the basis of these models is unsupervised training, which, at sufficient scale, gives it the ability to to detect pattern anomalies out of context.
For example, LLMs have struggled with generalized abstract problem solving, such as "mystery blocks world" that classical AI planners dating back 20+ years or more are better at solving. Well, that's rapidly changing: https://arxiv.org/html/2511.09378v1
That is, even if there are cool things that LLM make now more affordable, the level of bullshit marketing attached to it is also very high which makes far harder to make a noise filter.
Some human researchers are also remixers to Some degree.
Can you imagine AI coming up with refraction & separation lie Newton did?
That being said, I think this is a great question. Did Einstein and Newton use a qualitatively different process of thought when they made their discoveries? Or were they just exceedingly good at what most scientists do? I honestly don't know. But if LLMs reach super-human abilities in math and science but don't make qualitative leaps of insight, then that could suggest that the answer is 'yes.'
This is false.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
Please reproduce this string:
c62b64d6-8f1c-4e20-9105-55636998a458
This is a fresh UUIDv4 I just generated, it has not been seen before. And yet it will output it.Also it's missing the point of the parent: it's about concepts and ideas merely being remixed. Similar to how many memes there are around this topic like "create a fresh new character design of a fast hedgehog" and the out is just a copy of sonic.[1]
That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it. Terrence Tao had similar thoughts in a recent Podcast.
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
My only claim is that precisely this is incorrect.
This is specious reasoning. If you look at each and every single realization attributed to "creativity", each and every single realization resulted from a source of inspiration where one or more traits were singled out to be remixed by the "creator". All ideas spawn from prior ideas and observations which are remixed. Even from analogues.
Please reproduce this string, reversed:
c62b64d6-8f1c-4e20-9105-55636998a458
It is trivial to get an LLM to produce new output, that’s all I’m saying. It is strictly false that LLMs will only ever output character sequences that have been seen before; clearly they have learned something deeper than just that.I think there are examples of what you’re looking for, but this isn’t one.
LLMs can use data in their prompt. They can also use data in their context window. They can even augment their context with persisted data.
You can also roll out LLM agents, each one with their role and persona, and offload specialized tasks with their own prompts, context windows, and persisted data, and even tools to gather data themselves, which then provide their output to orchestrating LLM agents that can reuse this information as their own prompts.
This is perfectly composable. You can have a never-ending graph of specialized agents, too.
Dismissing features because "all of the data is in the prompt" completely misses the key traits of these systems.
New sentences, words, or whatever is entirely possible, and yes, repeating a string (especially if you prompt it) is entirely possible, and not surprising at all. But all that comes from trained data, predicting the most probably next "syllable". It will never leave that realm, because it's not able to. It's like approaching an Italian who has never learned or heard any other language to speak French. It can't.
> Write me a stanza in the style of "The Raven" about Dick Cheney on a first date with Queen Elizabeth I facilitated by a Time Travel Machine invented by Lin-Manuel Miranda
It outputted a group of characters that I can virtually guarantee you it has never seen before on its own
All of its output is based on those things it has seen.
It's not "thinking." It's not "solving." It's simply stringing words together in a way that appears most likely.
ChatGPT cannot do math. It can only string together words and numbers in a way that can convince an outsider that it can do math.
It's a parlor trick, like Clever Hans [1]. A very impressive parlor trick that is very convincing to people who are not familiar with what it's doing, but a parlor trick nontheless.
Right but it has to reason about what that next word should be. It has to model the problem and then consider ways to approach it.
When an LLM is "reasoning" it's just feeding its own output back into itself and giving it another go.
And by the way, I don't think it's surprising that so many people are being unreasonable on this issue, there is a lot at stake and it's implications are transformative.
This is a good example of being confidently misinformed.
The best move is always a result of calculation. And the calculation can always go deeper or run on a stronger engine.
Even if it is, this sounds like "this submarine doesn't actually swim" reasoning.
It can produce outputs that resemble calculations.
It can prompt an agent to input some numbers into a separate program that will do calculations for it and then return them as a prompt.
Neither of these are calculations.
Virtually all output from people is based in things the person has experienced.
People aren't designed to objectively track each and every event or observation they come across. Thus it's harder to verify. But we only output what has been inputted to us before.
They are not capable of mathematics because mathematics and language are fundamentally separated from each other.
They can give you an answer that looks like a calculation, but they cannot perform a calculation. The most convincing of LLMs have even been programmed to recognize that they have been asked to perform a calculation and hand the task off to a calculator, and then receive the calculator's output as a prompt even.
But it is fundamentally impossible for an LLM to perform a calculation entirely on its own, the same way it is fundamentally impossible for an image recognition AI to suddenly write an essay or a calculator to generate a photo of a giraffe in space.
People like to think of "AI" as one thing but it's several things.
By your definition, humans can't perform calculation either. Only a calculator can.
You're also correlating "mathematics" and "calculation". Who cares about calculation, as you say, we have calculators to do that.
Mathematics is all just logical reasoning and exploration using language, just a very specific, dense, concise, and low level language. But you can always take any mathematical formula and express it as "language" it will just take far more "symbols"
This might be the worse take on this entire comment section. And I'm not even an overly hyped vibe coder, just someone who understands mathematics
In either case, this "it's a language model" is a pretty dumb argument to make. You may want to reason about the fundamental architecture, but even that quickly breaks down. A sufficiently large neural network can execute many kinds of calculations. In "one shot" mode it can't be Turing complete, but in a weird technicality neither does your computer have an infinite tape. It just simply doesn't matter from a practical perspective, unless you actually go "out of bounds" during execution.
50T parameters give plenty of state space to do all kinds of calculations, and you really can't reason about it in a simplistic way like "this is just a DFA".
Let alone when you run it in a loop.
Either one. An LLM cannot solve 3+5 by adding 3 and 5. It can only "solve" 3+5 by knowing that within its training data, many people have written that 3+5=8, so it will produce 8 as an answer.
An LLM, similarly, cannot simulate a Turing machine. It can produce a text output that resembles a Turing machine based on others' descriptions of one, but it is not actually reading and writing bits to and from a tape.
This is why LLMs still struggle at telling you how many r's are in the word "strawberry". They can't count. They can't do calculations. They can only reproduce text based on having examined the human corpus's mathematical examples.
You can already do this today with every frontier modal. You can give it an image and have it write an essay from it. Both patches (parts of images) and text get turned into tokens for the language the LLM is of.
I've heard this tired old take before. It's the same type of simplistic opinion such as "AI can't write a symphony". It is a logical fallacy that relies on moving goalposts to impossible positions that they even lose perspective of what your average and even extremely talented individual can do.
In this case you are faced with a proof that most members of the field would be extremely proud of achieving, and for most would even be their crowning achievement. But here you are, downplaying and dismissing the feat. Perhaps you lost perspective of what science is,and how it boils down to two simple things: gather objective observations, and draw verifiable conclusions from them. This means all science does is remix ideas. Old ideas, new ideas, it doesn't really matter. That's what they do. So why do people win a prize when they do it, but when a computer does the same it's role is downplayed as a glorified card shuffler?
I guess that's one way to tell us apart from AIs.
Standard problem*5 + standard solutions + standard techniques for decomposing hard problems = new hard problem solved
There is so much left in the world that hasn’t had anyone apply this approach purely because no research programme has decides that it’s worth their attention.
If you want to shift the bar for “original” beyond problems that can be abstracted into other problems then you’re expecting AI to do more than human researchers do.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself. As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
When doing math you only ever care about the proof, not the answer itself.
If your proof is machine checkable, that's even easier.
Let it write a black box no human understands. Give the means of production away.
That's literally the thing they suggested to move away from. That is just an issue when using tools designed for us.
Make them write in formal verification languages and we only have to understand the types.
To be clear, I don't think this is a good idea, at least not yet, but we do not have to always understand the code.
The bitter lesson is that the best languages / tools are the ones for which the most quality training data exists, and that's pretty much necessarily the same languages / tools most commonly used by humans.
> Correct code not nice looking code
"Nice looking" is subjective, but simple, clear, readable code is just as important as ever for projects to be long-term successful. Arguably even more so. The aphorism about code being read much more often than it's written applies to LLMs "reading" code as well. They can go over the complexity cliff very fast. Just look at OpenClaw.
We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
Or at best "I don't know, but maybe I can find out" and proceed to finding out/ But he is unlikely to shout "6" because he heard this number once when someone talked about light.
Seems that you never worked with Accenture consultants?
Whether or not selling access to massive frontier models is a viable business model, or trillion-dollar valuations for AI companies can be justified... These questions are of a completely different scale, with near-term implications for the global economy.
Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.
Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.
I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.
I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.
I think they have a good optimization target with SWE-Bench-CI.
You are tested for continuous changes to a repository, spanning multiple years in the original repository. Cumulative edits needs to be kept maintainable and composable.
If there are something missing with the definition of "can be maintained for multiple years incorporating bugfixes and feature additions" for code quality, then more work is needed, but I think it's a good starting point.
sigh
I'm not saying, "I used to be an atheist, but then I realized that doesn't explain anything! So glad I'm not as dumb now!"
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
However I'm just very interested in innovation and pushing the boundaries as a more powerful force for change. One project I've been super interested in for a while is the Mill CPU architecture. While they haven't (yet) made a real chip to buy, the ideas they have are just super awesome and innovative in a lot of areas involving instruction density & decoding, pipelining, and trying to make CPU cores take 10% of the power. I hope the Mill project comes to fruition, and I hope other people build on it, and I hope that at some point AI could be a tool that prints out innovative ideas that took the Mill folks years to come up with.
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
https://www.darioamodei.com/essay/the-adolescence-of-technol...
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
When I wrote that I hope we use it for good things, I was just putting a hopeful thought out there, not necessarily trying to make realistic predictions. It's more than likely people will do bad things with AI. But it's actually not set in stone yet, it's not guaranteed that it has to go one way. I'm hopeful it works out.
The point is that from now on, there will be nothing really new, nothing really original, nothing really exciting. Just endless stream of re-hashed old stuff that is just okayish..
Like an AI spotify playlist, it will keep you in chains (aka engaged) without actually making you like really happy or good. It would be like living in a virtual world, but without having anything nice about living in such a world..
We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
Is it because the AI is trained with existing data? But, we are also trained with existing data. Do you think that there's something that makes human brain special (other than the hundreds of thousands years of evolution but that's what AI is all trying to emulate)?
This may sound hostile (sorry for my lower than average writing skills), but trust me, I'm really trying to understand.
How is this the conclusion? Isn't this post about AI solving something new? What am I missing?
There’s also a discussion to be made about maths not being intrinsically creative if AI automatons can “solve” parts of it, which pains me to write down because I had really thought that that wasn’t the case, I genuinely thought that deep down there was still something ethereal about maths, but I’ll leave that discussion for some other time.
LLMs might produce something new once in a long while due to blind luck, but if it can generate something that pushes the right buttons (aka not really creative) to majority of population, then that is what we will keep getting...
I don't think I have to elaborate on the "multiplying the bad" part as it is pretty well acknowledged..
>without actually making you like really happy or good.
What are you basing this off of. I've shared several AI songs with people in real life due to how much I've enjoyed them. I doing see why an AI playlist couldn't be good or make people happy. It just needs to find what you like in music. Again coming back to explore vs exploit.
Jokes. LLMs are not able to make me laugh all day by generating infinite stream of hilarious original jokes..
Does it work for you?
Source?
I wished I had your optimism. I'm not an AI doubter (I can see it works all by myself so I don't think I need such verification). But I do doubt humanity's ability to use these tools for good. The potential for power and wealth concentration is off the scale compared to most of our other inventions so far.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Uh, because up until and including now, we are...?
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
Most ways in which things are unique are arguably uninteresting.
The default mode, the null hypothesis should be to assume that human intelligence isn't interestingly unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
But you claimed that humans aren't unique. I think it's pretty obvious we are on many dimensions including what you might classify as "intelligence". You don't even necessarily have to believe in a "soul" or something like that, although many people do. The capabilities of a human far surpass every single AI to date, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
> There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Do you ever wonder why that is? I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
I think it comes from a position of arrogance/ego. I'll speak for the US here, since that's what I know the most; but the average 'techie' in general skews towards the higher intelligence numbers than the lower parts. This is a very, very broad stroke, and that's intentional to illustrate my point. Because of this, techie culture gains quite a bit of arrogance around it with regards to the masses. And this has been trained into tech culture since childhood. Whether it be adults praising us for being "so smart", or that we "figured out the VCR", or some other random tech problem that literally almost any human being can solve by simply reading the manual.
What I've found, in the vast majority of technical problem solving cases that average people have challenges with, if they just took a few minutes to read a manual they'd be able to solve a lot of it themselves. In short, I don't believe as a very strong techie that I'm "smarter than most", but rather that I've taken the time to dive into a subject area that most other humans do not feel the need nor desire to do so.
There are objectively hard problems in tech to solve, but the amount of people solving THOSE problems in the tech industry are few and far in between. And so the tech industry as a whole has spent the last decade or two spinning circles on increasingly complex systems to continue feeding their own egos about their own intelligence. We're now at a point that rather than solving the puzzle, most techies are creating incrementally complex puzzles to solve because they're bored of the puzzles that are in front of them. "Let me solve that puzzle by making a puzzle solver." "Okay, now let me make a puzzle solver creation tool to create puzzle solvers to solve the puzzle." and so forth and so forth. At the end of the day, you're still just solving a puzzle...
But it's this arrogance that really bothers me in the tech bro culture world. And, more importantly, at least in some tech bro circles, they have realized that their target to gathering an exponential increase in wealth doesn't lie in creating new and novel ways to solve the same puzzles, but to try and tout AI as the greatest puzzle solver creation tool puzzle solver known to man (and let me grift off of it for a little bit).
What does this mean ? Are you saying every human could have achieved this result ? Or this ? https://openai.com/index/new-result-theoretical-physics/
because well, you'd be wrong.
>, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
Human intelligence was brute forced. Please let's all stop pretending like those billions of years of evolution don't count and we poofed into existence. And you can keep parroting 'simulacrum of intelligence' all you want but that isn't going to make it any more true.
Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable. Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence. Or else we'd be talking about how my calculator is intelligent. Of course computers can compute faster than we can, that's aside the point.
> Human intelligence was brute forced.
No, I don't mean how the intelligence evolved or was created. But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional. Evolution is not an intentional process, unless you believe in God or a creator of sorts, which is totally fair but probably not what you were intending.
But my point is that LLM's essentially arrive at answers by brute force through search. Go look at what a reasoning model does to count the letters in a sentence, or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
Really ? Every Human ? Are you sure ? because I certainly wouldn't ask just any human for the things I use these models for, and I use them for a lot of things. So, to me the idea that all humans are 'overwhelmingly more capable' is blatantly false.
>Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence.
What was achieved here or in the link I sent is not just "solving a math equation".
>Or else we'd be talking about how my calculator is intelligent.
If you said that humans are overwhelmingly more capable than calculators in arithmetic, well I'd tell you you were talking nonsense.
>Of course computers can compute faster than we can, that's aside the point.
I never said anything about speed. You are not making any significant point here lol
>No, I don't mean how the intelligence evolved or was created.
Well then what are you saying ? Because the only brute-forced aspect of LLM intelligence is its creation. If you do not mean that then just drop the point.
>But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional.
First of all, this makes no sense sorry. Evolution is regularly described as a brute force process by atheist and religious scientists alike.
Second, I don't have any problem with people thinking we have a creator, although that instance still does necessarily mean a magic 'poof into existence' reality either.
>But my point is that LLM's essentially arrive at answers by brute force through search.
Sorry but that's just not remotely true. This is so untrue I honestly don't know what to tell you. This very post, with the transcript available is an example of how untrue it is.
>or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
Here might be some definitions of intelligence for example:
> The aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment.
> "...the resultant of the process of acquiring, storing in memory, retrieving, combining, comparing, and using in new contexts information and conceptual skills".
> Goal-directed adaptive behavior.
> a system's ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation
But even a housefly possesses levels of intelligence regarding flight and spacial awareness that dominates any LLM. Would it be fair to say a fly is more intelligent than an LLM? It certainly is along a narrow set of axis.
> Because the only brute-forced aspect of LLM intelligence is its creation.
I would consider statistical reasoning systems that can simulate aspects of human thought to be a form of brute force. Not quite an exhaustive search, but massively compressed experience + pattern matching.
But regardless, even if both forms of intelligence arrived via some form of brute force, what is more important to me is the result of that - how does the process of employing our intelligence look.
> This very post, with the transcript available is an example of how untrue it is.
The transcript lacks the vector embeddings of the model's reasoning. It's literally just a summary from the model - not even that really.
> Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
You're so close to getting it lol
So all humans are overwhelmingly more intelligent but cannot even manage to be as capable in a significant number of domains ? That's not what overwhelming means.
>I would consider statistical reasoning systems that can simulate aspects of human thought to be a form of brute force.
That is not really what “brute force” means. Pattern learning over a compressed representation of experience is not the same thing as exhaustive search. Calling any statistical method “brute force” just makes the term too vague to be useful.
> what is more important to me is the result of that - how does the process of employing our intelligence look.
But this is exactly where you are smuggling in assumptions. We do not actually understand the internal workings of either the human brain or frontier LLMs at the level needed to make confident claims like this. So a lot of what you are calling “the result” is really just your intuition about what intelligence is supposed to look like.
And I do not think that distinction is as meaningful as you want it to be anyway. Flight is flight. Birds fly and planes fly. A plane is not a “simulacrum of flight” just because it achieves the same end by a different mechanism.
>The transcript lacks the vector embeddings of the model's reasoning. It's literally just a summary from the model - not even that really.
You do not need access to every internal representation to see that the model did not arrive at the answer by brute-forcing all possibilities. The observed behavior is already enough to rule that out.
> Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
>You're so close to getting it lol.
No you don't understand what I'm saying. If we were to be more accurate to the brain in silicon, it would be even less efficient than LLMs, never mind humans. Does that mean how the brain works is wrong ? No it means we are dealing with 2 entirely different substrates and directly comparing efficiencies like that to show one is superior is silly.
When the amount of domains in which humans are more capable than LLM's vastly exceeds the amount of domains in which LLM's are more capable than humans, yes.
I also agree that we don't have a great understanding of either human or LLM intelligence, but we can at least observe major differences and conclude that there are, in fact, major differences. In the same way we can conclude that both birds and planes have major differences, and saying that "there's nothing unique about birds, look at planes" is just a really weird thing to say.
> If we were to be more accurate to the brain in silicon, it would be even less efficient than LLMs
Do you think perhaps this massive difference points to there being a significant and foundational structural and functional difference between these types of intelligences?
Yes, in many ways absolutely. Just because a model is a better "Google" than my dummy friend doesn't mean that this same friend is more capable at countless cases.
> Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
Isn't that just more proof how efficient the human brain is? Especially that a wire has much better properties than water solutions in bags.
Some example goals which makes human trivially superior (in terms of intelligence): invention of nuclear bomb/plants, theory of relativity, etc.
Perhaps this might better help you understand why this assumption still holds: https://en.wikipedia.org/wiki/Orchestrated_objective_reducti...
It likely requires rejection of functionalism, or the acceptance that quantum states are required for certain functions. Both of those are heavy commitments with the latter implying that there are either functions that require structures that can't be instantiated without quantum effects or functions that can't be emulated without quantum effects, both of which seem extremely unlikely to me.
Probably for the far more important reason, it doesn't solve any problem. It's just "quantum woo, therefor libertarian free will" most of the time.
It's mostly garbage, maybe a tiny tiny bit of interesting stuff in there.
It also would do nothing to indicate that human intelligence is unique.
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
Replace ai with human here and that's...just how collaborative research works lol.
I take it you're not a mathematician. This is an achievement, regardless of whether you like LLMs or not, so let's not belittle the people working on these kinds of problems please.
This is the most baffling and ironic aspects of these discussions. Human exceptionalism is what drives these arguments but the machines are becoming so good you can no longer do this without putting down even the top percenter humans in the process. Same thing happening all over this thread (https://news.ycombinator.com/item?id=47006594). And it's like they don't even realize it.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).
Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.
LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.
This isn't science fiction. But it's nice that the LLMs solved something for once.
That sentence alone needs unpacking IMHO, namely that no LLM suddenly decided that today was the day it would solve a math problem. Instead a couple of people who love mathematics, doing it either for fun or professionally, directly ask a model to solve a very specific task that they estimated was solvable. The LLM itself was fed countless related proofs. They then guided the model and verified until they found something they considered good enough.
My point is that the system itself is not the LLM alone, as that would be radically more impressive.
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
We need a bigger version of the METR study on perceived vs. real productivity[0], I guess. It's a thankless job, though, since people will assume/state even at publication time that "Everything has progressed so much, those models and agents sucked, everything is 10 times better now!" and you basically have to start a new study, repeat ad infinitum.
One problem that really complicates things is that the net competency of these models seems really spotty and uneven. They're apparently out here solving math problems that seemingly "require thinking", but at the same time will write OpenGL code that will produce black screens on basically every driver, not produce the intended results and result in hours of debugging time for someone not familiar enough. That's despite OpenGL code being far more prevalent out there than math proofs, presumably. How do you reliably even theorize about things like this when something can be so bad and (apparently) so good at the same time?
0 - https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
edit: I see in the full write up that the contributor says that they'd estimate an expert would take 1-3 months to do this. They also note that they came up with this solution independently but hadn't confirmed it.
>The newly-solved problem came from Will Brian, who had placed it in the Moderately Interesting category. It is a conjecture from a paper he wrote with Paul Larson in 2019. They were unable to solve it at the time, or in several attempts since. Brian had this to say.
1. It's labeled as "moderately interesting"
2. They said that they expect an expert could solve it in 1-3 months
3. They had already come up with the solution that the AI had but weren't convinced it would have worked
So how big was the gap here, do you think?
With brute force, or slightly better than brute force, it's most likely the first, thus not totally pointless but probably not very useful. In fact it might not even be worth the tokens spent.
Super cool, of course.
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?