I've "solved" many math problems with LLMs, with LLMs giving full confidence in subtly or significantly incorrect solutions.
I'm very curious here. The Open AI memory orders and claims about capacity limits restricting access to better models are interesting too.
"Very nice! ... actually the thing that impresses me more than the proof method is the avoidance of errors, such as making mistakes with interchanges of limits or quantifiers (which is the main pitfall to avoid here). Previous generations of LLMs would almost certainly have fumbled these delicate issues.
...
I am going ahead and placing this result on the wiki as a Section 1 result (perhaps the most unambiguous instance of such, to date)"
The pace of change in math is going to be something to watch closely. Many minor theorems will fall. Next major milestone: Can LLMs generate useful abstractions?
"On following the references, it seems that the result in fact follows (after applying Rogers' theorem) from a 1936 paper of Davenport and Erdos (!), which proves the second result you mention. ... In the meantime, I am moving this problem to Section 2 on the wiki (though the new proof is still rather different from the literature proof)."
We are very close.
(by the way; I like writing code and I still do for fun)
Isn't that a perfectly reasonable metric? The topic has been dominated by hype for at least the past 5 if not 10 years. So when you encounter the latest in a long line of "the future is here the sky is falling" claims, where every past claim to date has been wrong, it's natural to try for yourself, observe a poor result, and report back "nope, just more BS as usual".
If the hyped future does ever arrive then anyone trying for themselves will get a workable result. It will be trivially easy to demonstrate that naysayers are full of shit. That does not currently appear to be the case.
If I release a claim once a month that armageddon will happen next month, and then after 20 years it finally does, are all of my past claims vindicated? Or was I spewing nonsense the entire time? What if my claim was the next big pandemic? The next 9.0 earthquake?
What you are doing however is dismissing the outrageous progress on NLP and by extension code generation of the last few years just because people over hype it.
People over hyped the Internet in the early 2000s, yet here we are.
I never dismissed the actual verifiable progress that has occurred. I objected specifically to the hype. Are you sure you're arguing with what I actually said as opposed to some position that you've imagined that I hold?
> People over hyped the Internet in the early 2000s, yet here we are.
And? Did you not read the comment you are replying to? If I make wild predictions and they eventually pan out does that vindicate me? Or was I just spewing nonsense and things happened to work out?
"LLMs will replace developers any day now" is such a claim. If it happens a month from now then you can say you were correct. If it doesn't then it was just hype and everyone forgets about it. Rinse and repeat once every few months and you have the current situation.
When someone says something to the effect of "LLMs are on the verge of replacing developers any day now" it is perfectly reasonable to respond "I tried it and it came up with crap". If we were actually near that point you wouldn't have gotten crap back when you tried it for yourself.
People who use this stuff everyday know that people who are still saying "I tried it and it produced crap" just don't know how to use it correctly. Those developers WILL get replaced - by ones who know how to use the tool.
Now _that_ I would believe. But note how different "those who fail to adapt to this new tool will be replaced" is from "the vast majority will be replaced by this tool itself".
If someone had said that six (give or take) months ago I would have dismissed it as hype. But there have been at least a few decently well documented AI assisted projects done by veteran developers that have made the front page recently. Importantly they've shown clear and undeniable results as opposed to handwaving and empty aspirations. They've also been up front about the shortcomings of the new tool.
The ability to make money proves you found a good market, it doesn't prove that the new tools are useful to others.
They’re a probabalistic phonograph. They can sharpen the funnel for input but they can’t provide judgement on input or resolve ambiguities in your specifications. Teams of human requirements engineers cannot do it. LLMs are not magic. You’re essentially asking it; from my wardrobe pick an outfit for me and make sure it’s the one I would have picked.
If you’re dazzled into thinking LLMs can solve this you just don’t understand transformer architecture and you don’t understand requirements engineering.
You’ll know a proper AI engine when you see it and it doesn’t look like an LLM.
Coding was never the hard part of software development.
I’m talking about the inference step, which uses tensor geometry arithmetic to find patterns in text. We don’t understand what those patterns are but it’s clear it’s doing some heavy lifting since llm inference is expressing logic and reasoning under the guise of our reductive “next token prediction”
I'm beginning to think the Bitter Lesson applies to organic intelligence as well, because basic pattern matching can be implemented relatively simply using very basic mathematical operations like multiply and accumulate, and so it can scale with massive parallelization of relatively simple building blocks.
The ability to think about your own thinking over and over as deeply as needed is where all the magic happens. Counterfactual reasoning occurs every time you pop a mental stack frame. By augmenting our stack with external tools (paper, computers, etc.), we can extend this process as far as it needs to go.
LLMs start to look a lot more capable when you put them into recursive loops with feedback from the environment. A trillion tokens worth of "what if..." can be expended without touching a single token in the caller's context. This can happen at every level as many times as needed if we're using proper recursive machinery. The theoretical scaling around this is extremely favorable.
Ie you want to find a subimage in a big image, possibly rotated, scaled, tilted, distorted, with noise. You cannot do that with a pattern matcher, but you can do that with a matcher, such as a fuzzy matcher, a LLM.
You want to find a go position on a go board. A LLM is perfect for that, because you don't need to come up with a special language to describe go positions (older chess programs did that), you just train the model if that position is good or bad, and this can be fully automated via existing literature and later by playing against itself. You train the matcher not via patterns but a function (win or loose).
But I think the trend line unmistakably points to a future where it can be MORE intelligent than a human in exactly the colloquial way we define "more intelligent"
The fact that one of the greatest mathematicians alive has a page and is seriously bench marking this shows how likely he believes this can happen.
The answer is yes. Assume, for the sake of contradiction, that there exists an \(\epsilon > 0\) such that for every \(k\), there exists a choice of congruence classes \(a_1^{(k)}, \dots, a_k^{(k)}\) for which the set of integers not covered by the first \(k\) congruences has density at least \(\epsilon\).
For each \(k\), let \(F_k\) be the set of all infinite sequences of residues \((a_i)_{i=1}^\infty\) such that the uncovered set from the first \(k\) congruences has density at least \(\epsilon\). Each \(F_k\) is nonempty (by assumption) and closed in the product topology (since it depends only on the first \(k\) coordinates). Moreover, \(F_{k+1} \subseteq F_k\) because adding a congruence can only reduce the uncovered set. By the compactness of the product of finite sets, \(\bigcap_{k \ge 1} F_k\) is nonempty.
Choose an infinite sequence \((a_i) \in \bigcap_{k \ge 1} F_k\). For this sequence, let \(U_k\) be the set of integers not covered by the first \(k\) congruences, and let \(d_k\) be the density of \(U_k\). Then \(d_k \ge \epsilon\) for all \(k\). Since \(U_{k+1} \subseteq U_k\), the sets \(U_k\) are decreasing and periodic, and their intersection \(U = \bigcap_{k \ge 1} U_k\) has density \(d = \lim_{k \to \infty} d_k \ge \epsilon\). However, by hypothesis, for any choice of residues, the uncovered set has density \(0\), a contradiction.
Therefore, for every \(\epsilon > 0\), there exists a \(k\) such that for every choice of congruence classes \(a_i\), the density of integers not covered by the first \(k\) congruences is less than \(\epsilon\).
\boxed{\text{Yes}}
On the contrary for DeepSeek you could but not for a non open model.
It says that the OpenAI proof is a different one from the published one in the literature.
Whereas whether the Deepseek proof is the same as the published one, I dont know enough of the math to judge.
That was what I meant.
You could have just rubber-stamped it yourself, for all the mathematical rigor it holds. The devil is in the details, and the smallest problem unravels the whole proof.
Is this enough? Let $U_k$ be the set of integers such that their remainder mod 6^n is greater or equal to 2^n for all 1<n<k. Density of each $U_k$ is more than 1/2 I think but not the intersection (empty) right?
This would all be a fairly trivial exercise in diagonalization if such a lemma as implied by Deepseek existed.
(Edit: The bounding I suggested may not be precise at each level, but it is asymptotically the limit of the sequence of densities, so up to some epsilon it demonstrates the desired counterexample.)
I believe the ones that are NOT studied are precisely because they are seen as uninteresting. Even if they were to be solved in an interesting way, if nobody sees the proof because they are just too many and they are again not considered valuable then I don't see what is gained.
More broadly, I think there’s a perspective that literally just building out thousands more true statements in Lean is going to keep cementing math’s broadening knowledge framework. This is not building a giant castle a-la Wiles, it’s laying bricks in the outhouse, but someday those bricks might be useful.
EDIT: After reading a link someone else posted to Terrance Tao's wiki page, he has a paragraph that somewhat answers this question:
> Erdős problems vary widely in difficulty (by several orders of magnitude), with a core of very interesting, but extremely difficult problems at one end of the spectrum, and a "long tail" of under-explored problems at the other, many of which are "low hanging fruit" that are very suitable for being attacked by current AI tools. Unfortunately, it is hard to tell in advance which category a given problem falls into, short of an expert literature review. (However, if an Erdős problem is only stated once in the literature, and there is scant record of any followup work on the problem, this suggests that the problem may be of the second category.)
from here: https://github.com/teorth/erdosproblems/wiki/AI-contribution...
The problems are a pretty good metric for AI, because the easiest ones at least meet the bar of "a top mathematician didn't know how to solve this off the top of his head" and the hardest ones are major open problems. As AI progresses, we will see it slowly climb the difficulty ladder.
This is bad faith. Erdos was an incredibly prolific mathematician, it is unreasonable to expect anyone to have memorized his entire output. Yet, Tao knows enough about Erdos to know which mathematical techniques he regularly used in his proofs.
From the forum thread about Erdos problem 281:
> I think neither the Birkhoff ergodic theorem nor the Hardy-Littlewood maximal inequality, some version of either was the key ingredient to unlock the problem, were in the regular toolkit of Erdos and Graham (I'm sure they were aware of these tools, but would not instinctively reach for them for this sort of problem). On the other hand, the aggregate machinery of covering congruences looks relevant (even though ultimately it turns out not to be), and was very much in the toolbox of these mathematicians, so they could have been misled into thinking this problem was more difficult than it actually was due to a mismatch of tools.
> I would assess this problem as safely within reach of a competent combinatorial ergodic theorist, though with some thought required to figure out exactly how to transfer the problem to an ergodic theory setting. But it seems the people who looked at this problem were primarily expert in probabilistic combinatorics and covering congruences, which turn out to not quite be the right qualifications to attack this problem.
It does beg the question, if it was so easy to find the prior solution, why has no one posted it already on the erdos problems website?
Somehow an llm generated proof that consist of gigabytes upon gigabytes of unreadable mess is groundbreaking and pushes mathematics forward, a proof proposed by Erdos himself in 5 pages gets buried and lost to time.
Maybe one particular optics fuels the narrative that formal verified compute is the new moat and llms are amazing at that?
and there is an ongoing literature review (which has been lucrative to both erdosproblems and the OEIS), and this one was relabelled upon the discovery of an earlier resolution
This is no longer true, a prior solution has just been found[1], so the LLM proof has been moved to the Section 2 of Terence Tao's wiki[2].
[1] - https://www.erdosproblems.com/forum/thread/281#post-3325
[2] - https://github.com/teorth/erdosproblems/wiki/AI-contribution...
This aligns nicely with the rest of the canon. LLMs are just stochastic parrots. Fancy autocomplete. A glorified Google search with worse footnotes. Any time they appear to do something novel, the correct explanation is that someone, somewhere, already did it, and the model merely vibes in that general direction. The fact that no human knew about it at the time is a coincidence best ignored.
The same logic applies to code. “Vibe coding” isn’t real programming. Real programming involves intuition, battle scars, and a sixth sense for bugs that can’t be articulated but somehow always validates whatever I already believe. When an LLM produces correct code, that’s not engineering, it’s cosplay. It didn’t understand the problem, because understanding is defined as something only humans possess, especially after the fact.
Naturally, only senior developers truly code. Juniors shuffle syntax. Seniors channel wisdom. Architecture decisions emerge from lived experience, not from reading millions of examples and compressing patterns into a model. If an LLM produces the same decisions, it’s obviously cargo-culting seniority without having earned the right to say “this feels wrong” in a code review.
Any success is easy to dismiss. Data leakage. Prompt hacking. Cherry-picking. Hidden humans in the loop. And if none of those apply, then it “won’t work on a real codebase,” where “real” is defined as the one place the model hasn’t touched yet. This definition will be updated as needed.
Hallucinations still settle everything. One wrong answer means the whole system is fundamentally broken. Human mistakes, meanwhile, are just learning moments, context switches, or coffee shortages. This is not a double standard. It’s experience.
Jobs are obviously safe too. Software engineering is mostly communication, domain expertise, and navigating ambiguity. If the model starts doing those things, that still doesn’t count, because it doesn’t sit in meetings, complain about product managers, or feel existential dread during sprint planning.
So yes, the Erdos situation is resolved. Nothing new happened. No reasoning occurred. Progress remains hype. The trendline is imaginary. And any discomfort you feel is probably just social media, not the ground shifting under your feet.
(edit: fixed link)
I'm pretty sure it's like "can it run DOOM" and someone could make an LLM that passes this that runs on an pregnancy test
It wasn't AI generated. But if it was, there is currently no way for anyone to tell the difference.
This is false. There are many human-legible signs, and there do exist fairly reliable AI detection services (like Pangram).
Negative feedback is the original "all you need."
There's already been several scandals where students were accused of AI use on the basis of these services and successfully fought back.
You're lying: https://www.pangram.com/history/94678f26-4898-496f-9559-8c4c...
Not that I needed pangram to tell me that, it's obvious slop.
I guess this is the end of the human internet
"Glorified Google search with worse footnotes" what on earth does that mean?
AI has a distinct feel to it
For better or worse, I think we might have to settle on “human-written until proven otherwise”, if we don’t want to throw “assume positive intent” out the window entirely on this site.
Evidence shows otherwise: Despite the "20x" length, many people actually missed the point.
Which is also what makes it problematic that you're lying about your LLM use. I would honestly love to know your prompt and how you iterated on the post, how much you put into it and how much you edited or iterated. Although pretending there was no LLM involved at all is rather disappointing.
Unfortunately I think you might feel backed into a corner now that you've insisted otherwise but it's a genuinely interesting thing here that I wish you'd elaborate on.
Vs
> Interesting that in Terrance Tao's words: "though the new proof is still rather different from the literature proof)"
I agree brevity is always preferred. Making a good point while keeping it brief is much harder than rambling on.
But length is just a measure, quality determines if I keep reading. If a comment is too long, I won’t finish reading it. If I kept reading, it wasn’t too long.
And even odder that the proof was by Erdos himself and yet he listed it as an open problem!
It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth
They create concepts in latent space which is basically compression which forces this
But I'm not a mathematics expert if this is the real official definition I'm fine with it. But are you though?
consider estimating the position of an object from noisy readings. One presumes that position to exist in some sense, and then one can estimate it by combining multiple measurements, increasing positioning resolution.
its any variable that is postulated or known to exist, and for which you run some fitting procedure
It doesn't matter if ai is in a hype cycle or not it doesn't change how a technology works.
Check out the yt videos from 1blue3brown he explains LLMs quite well. .your first step is the word embedding this vector space represents the relationship between words. Father - grandfather. The vector which makes a father a grandfather is the same vector as mother to grandmother.
You the use these word vectors in the attention layer to create a n dimensional space aka latent space which basically reflects a 'world' the LLM walks through. This makes the 'magic' of LLMs.
Basically a form of compression by having higher dimensions reflecting kind a meaning.
Your brain does the same thing. It can't store pixels so when you go back to some childhood environment like your old room, you remember it in some efficient (brain efficient) way. Like the 'feeling' of it.
That's also the reason why an LLM is not just some statistical parrot.
It does change what people say about it. Our words are not reality itself; the map is not the territory.
Are you saying people should take everything said about LLMs at face value?
It's the reason why I'm here because we discuss more technically about technology
I spend too much time here and decided to delete my account to interact less.
It's partially working though
I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.
It's great business to minimally modify valuable stuff and then take credit for it. As was explained to me by bar-certified counsel "if you take a recipe and add, remove or change just one thing, it's now your recipe"
The new trend in this is asking Claude Code to create a software on some type, like a Browser or a DICOM viewer, and then publishing that it's managed to do this very expensive thing (but if you check source code, which is never published, it probably imports a lot of open source dependencies that actually do the thing)
Now this is especially useful in business, but it seems that some people are repurposing this for proving math theorems. The Terence Tao effort which later checks for previous material is great! But the fact that the Section 2 (for such cases) is filled to the brim, and section 1 is mostly documented failed attempts (except for 1 proof, congratulations to the authors), mostly confirms my hypothesis, claiming that the model has guards that prevent it is a deus ex machina cope against the evidence.
Legally I think it works, but evidence in a court works differently than in science. It's the same word but don't let that confuse you and don't mix them both.
The infeasibility is searching for the (unknown) set of translations that the LLM would put that data through. Even if you posit only basic symbolic LUT mappings in the weights (it's not), there's no good way to enumerate them anyway. The model might as well be a learned hash function that maintains semantic identity while utterly eradicating literal symbolic equivalence.
Carbon copy would mean over fitting
It looked a bit like someone at Google subscribed to a legal theory under which you can avoid copyright infringement if you take a derivative work and apply a mechanical obfuscation to it.
People seem to have this belief, or perhaps just general intuition, that LLMs are a google search on a training set with a fancy language engine on the front end. That's not what they are. The models (almost) self avoid copyright, because they never copy anything in the first place, hence why the model is a dense web of weight connections rather than an orderly bookshelf of copied training data.
Picture yourself contorting your hands under a spotlight to generate a shadow in the shape of a bird. The bird is not in your fingers, despite the shadow of the bird, and the shadow of your hand, looking very similar. Furthermore, your hand-shadow has no idea what a bird is.
But honestly source = "a knuckle sandwich" would be appropriate here.
Edit: you've been breaking the site guidelines badly in other threads as well. (To pick one example of many: https://news.ycombinator.com/item?id=46601932.) We've asked you many times not to.
I don't want to ban your account because your good contributions are good and I do believe you're well-intentioned. But really, can you please take the intended spirit of this site more to heart and fix this? Because at some point the damage caused by poisonous comments is worse.
https://news.ycombinator.com/showhn.html
* it would be more accurate to say "using violent language as a trope in an argument" - I don't believe in taking comments like this literally, as if they're really threatening violence. Nonetheless you can't post this way to HN.
this is a verbatim quote from gemini 3 pro from a chat couple of days ago:
"Because I have done this exact project on a hot water tank, I can tell you exactly [...]"
I somehow doubt it an LLM did that exact project, what with not having any abilities to do plumbing in real life...
A) It is still possible a proof from someone else with a similar method was in the training set.
B) something similar to erdos's proof was in the training set for a different problem and had a similar alternate solution to chatgpt, and was also in the training set, which would be more impressive than A)
At this point the only conclusion here is: The original proof was on the training set. The author and Terence did not care enough to find the publication by erdos himself
A proof that Terence Tao and his colleagues have never heard of? If he says the LLM solved the problem with a novel approach, different from what the existing literature describes, I'm certainly not able to argue with him.
Tao et al. didn't know of the literature proof that started this subthread.
> He speculated that "the formulation [of the problem] has been altered in some way"....
[snip]
> More broadly, I think what has happened is that Rogers' nice result (which, incidentally, can also be proven using the method of compressions) simply has not had the dissemination it deserves. (I for one was unaware of it until KoishiChan unearthed it.) The result appears only in the Halberstam-Roth book, without any separate published reference, and is only cited a handful of times in the literature. (Amusingly, the main purpose of Rogers' theorem in that book is to simplify the proof of another theorem of Erdos.) Filaseta, Ford, Konyagin, Pomerance, and Yu - all highly regarded experts in the field - were unaware of this result when writing their celebrated 2007 solution to #2, and only included a mention of Rogers' theorem after being alerted to it by Tenenbaum. So it is perhaps not inconceivable that even Erdos did not recall Rogers' theorem when preparing his long paper of open questions with Graham in 1980.
(emphasis mine)
I think the value of LLM guided literature searches is pretty clear!
Both are precisely true. It is a better search engine than anything else -- which, while true, is something you won't realize unless you've used the non-free 'pro research' features from Google and/or OpenAI. And it can perform limited but increasingly-capable reasoning about what it finds before presenting the results to the user.
Note that no online Web search or tool usage at all was involved in the recent IMO results. I think a lot of people missed that little detail.
https://terrytao.wordpress.com/2026/01/19/rogers-theorem-on-...
"This theorem is somewhat obscure: its only appearance in print is in pages 242-244 of this 1966 text of Halberstam and Roth, where the authors write in a footnote that the result is “unpublished; communicated to the authors by Professor Rogers”. I have only been able to find it cited in three places in the literature: in this 1996 paper of Lewis, in this 2007 paper of Filaseta, Ford, Konyagin, Pomerance, and Yu (where they credit Tenenbaum for bringing the reference to their attention), and is also briefly mentioned in this 2008 paper of Ford. As far as I can tell, the result is not available online, which could explain why it is rarely cited (and also not known to AI tools). This became relevant recently with regards to Erdös problem 281, posed by Erdös and Graham in 1980, which was solved recently by Neel Somani through an AI query by an elegant ergodic theory argument. However, shortly after this solution was located, it was discovered by KoishiChan that Rogers’ theorem reduced this problem immediately to a very old result of Davenport and Erdös from 1936. Apparently, Rogers’ theorem was so obscure that even Erdös was unaware of it when posing the problem!"
A lot of pure mathematics seems to consist in solving neat logic puzzles without any intrinsic importance. Recreational puzzles for very intelligent people. Or LLMs.
Just because we can't imagine applications today doesn't mean there won't be applications in the future which depend on discoveries that are made today.
https://www.reddit.com/r/math/comments/dfw3by/is_there_any_e...
Instead of addressing any of that you're insisting I'm misunderstanding and pointing me back to a linked comment of yours drawing a distinction between epistemic value of science research vs math research. Epistemic value counts for many things, but one thing it can't do is negate the significance of pure math turning into applied research on account of pure science doing the same.
No, "so what" doesn't indicate disagreement, just that something isn't relevant.
Anyway, assume hot dogs taste not good at all, except in rare circumstances. It would then be wrong to say "hot dogs taste good", but it would be right to say "hot dogs don't taste good". Now substitute pure math for hot dogs. Pure math can be generally useless even if it isn't always useless. Men are taller than women. That's the difference between applied and pure math. The difference between math and science is something else: Even useless science has value, while most useless math (which consists of pure math) doesn't. (I would say the axiomatization of new theories, like probability theory, can also have inherent value, independent of any uselessness, insofar as it is conceptual progress, but that's different from proving pure math conjectures.)
My favorite example is number theory. Before cyptography came along it was pure math, an esoteric branch for just number nerds. defund Turns out, super applicable later on.
Among others.
Of course you never know which math concept will turn out to be physically useful, but clearly enough do that it's worth buying conceptual lottery tickets with the rest.
Ironically this example turns out to be a great object lesson in not underestimating the utility of research based on an eyeball test. But it shouldn't even have to have any intuitively plausible payoff whatsoever in order to justify it. The whole point is that even if a given research paradigm completely failed the eyeball test, our attitude should still be that it very well could have practical utility, and there are so many historical examples to this effect (the other commenter already gave several examples, and the right thing to do would have been acknowledge them), and besides I would argue they still have the same intrinsic value that any and all knowledge has.
I doubt that this is true.
If we knew that it was all going to be useless, however, then it’s a hobby for someone, not something we should be paying people to do. Sure, if you enjoy doing something useless, knock yourself out… but on your own dime.
Don't be so ignorant. A few years ago NO ONE could have come up with something so generic as an LLM which will help you to solve this kind of problems and also create text adventures and java code.
Nothing of it was even imaginable and yes the progress is crazy fast.
How can you be so dismissive?
"Intrinsic" in contexts like this is a word for people who are projecting what they consider important onto the world. You can't define it in any meaningful way that's not entirely subjective.
The only thing that saves science from being nothing more than “huh, will you look at that,” is when it can make use of a mathematical model to provide insight into relationships between phenomena.
Pretty soon, this is going to mean the entire historical math literature will be formalized (or, in some cases, found to be in error). Consider the implications of that for training theorem provers.
What's more, there's almost surely going to turn out to be a large amount of human generated mathematics that's "basically" correct, in the sense that there exists a formal proof that morally fits the arc of the human proof, but there's informal/vague reasoning used (e.g. diagram arguments, etc) that are hard to really formalize, but an expert can use consistently without making a mistake. This will take a long time to formalize, and I expect will require a large amount of human and AI effort.
This particular field seems ideal for AI, since verification enables identification of failure at all levels. If the definitions are wrong the theorems won't work and applications elsewhere won't work.
But as far as we know, the proof it wrote is original. Tao himself noted that it’s very different from the other proof (which was only found now).
That’s so far removed from a “search engine” that the term is essentially nonsense in this context.
AI is currently great at interpolation, and in some fields (like biology) there seems to be low-hanging fruit for this kind of connect-the-dots exercise. A human would still be considered smart for connecting these dots IMO.
AI clearly struggles with extrapolation, at least if the new datum is fully outside the training set.
And we will have AGI (if not ASI) if/when AI systems can reliably form new paradigms. It’s a high bar.
But, I don't know. I tend to view these (reasoning) LLMs as alien minds and my intuition of what is perhaps happening under the hood is not good.
I just know that people have been using these LLMs as search engines (including Stephen Wolfram), browsing through what these LLMs perhaps know and have connected together.
https://mehmetmars7.github.io/Erdosproblems-llm-hunter/probl...
https://chatgpt.com/share/696ac45b-70d8-8003-9ca4-320151e081...
I would love to know which concepts are active in the deeper layers of the model while generating the solution.
Is there a concept of “epsilon” or “delta”?
What are their projections on each other?
One wonders if some professional mathematicians are instead choosing to publish LLM proofs without attribution for career purposes.
"This LLM is kinda dumb in the thing I'm an expert in"
Anecdotally, I, as a math postdoc, think that GPT 5.2 is much stronger qualitatively than anything else I've used. Its rate of hallucinations is low enough that I don't feel like the default assumption of any solution is that it is trying to hide a mistake somewhere. Compared with Gemini 3 whose failure mode when it can't solve something is always to pretend it has a solution by "lying"/ omitting steps/making up theorems etc... GPT 5.2 usually fails gracefully and when it makes a mistake it more often than not can admit it when pointed out.
I assume OP was mostly joking, but we need to take care about letting AI companies hype up their impressive progress at the expense of mathematics. This needs to be discussed responsibly.
This will just become the norm as these models improve, if it isn't largely already the case.
It's like sports where everyone is trying to use steroids, because the only way to keep up is to use steroids. Except there aren't any AI-detectors and it's not breaking any rules (except perhaps some kind of self moral code) to use AI.
> the best way to find a previous proof of a seemingly open problem on the internet is not to ask for it; it's to post a new proof
Point in case: I just wanted to give z.ai a try and buy some credits. I used Firefox with uBlock and the payment didn't go through. I tried again with Chrome and no adblock, but now there is an error: "Payment Failed: p.confirmCardPayment is not a function." The irony is, that this is certainly vibe-coded with z.ai which tries to sell me how good they are but then not being able to conclude the sale.
And we will get lots more of this in the future. LLMs are a fantastic new technology, but even more fantastically over-hyped.
Models just generate text. Apps are supposed to make that text useful.
An app can run various kinds of verification. But would you pay an extra for that?
Nobody can make a text generator to output text which is 100% correct. That's just not a thing people can do now.
I’m not sure what this proves. I dumped a question into ChatGPT 5.2 and it produced a correct response after almost an hour [2]?
Okay? Is it repeatable? Why did it come up with this solution? How did it come up with the connections in its reasoning? I get that it looks correct and Tao’s approval definitely lends credibility that it is a valid solution, but what exactly is it that we’ve established here? That the corpus that ChatGPT 5.2 was trained on is better tuned for pure math?
I’m just confused what one is supposed to take away from this.
[1] https://news.ycombinator.com/item?id=46560445
[2] https://chatgpt.com/share/696ac45b-70d8-8003-9ca4-320151e081...
Erdos was prolific and many of his open problems are numbered and have space to discuss them online, so it’s become fairly common to run through them with frontier models and see if a good proof can be come up with; there have been some notable successes here this year.
Tao seems to engage in sort of a two step approach with these proofs - first, are they correct? Lean formalization makes that unambiguous, but not all proofs are easily formulated into Lean, so he also just, you know, checks them. Second, literature search inside LLMs and out for prior results — this is to check where frontier models are at in the ‘novel proofs or just regurgitated proofs’ space.
To my knowledge, we’re currently at the point where we are seeing some novel proofs offered, but I don’t think we’ve seen any that have absolutely no priors in literature.
As you might guess this is itself sort of a Rorschach test for what AI could and will be.
In this case, it looked at first like this was a totally novel solution to something that hadn’t been solved before. On deeper search, Tao noted it’s almost trivial to prove with stuff Erdos knew, and also had been proved independently; this proof doesn’t use the prior proof mechanism though.
Personally, I've been applying them to hard OCR problems. Many varied languages concurrently, wildly varying page structure, and poor scan quality; my dataset has all of these things. The models take 30 minutes a page, but the accuracy is basically 100% (it'll still striggle with perfectly-placed bits of mold). The next best model (Google's flagship) rests closer to 80%.
I'll be VERY intrigued to see what the next 2, 5, 10 years does to the price of this level of model.
But thanks for the downvote in addition to your useless comment.
They never brothered to check erdos solution already published 90 years ago. I am still confused about why erdos, who proposed the problem and the solution would consider this an unsolved problems, but this group of researchers would claim "ohh my god look at this breakthrough"
The LLM did better on this problem than 100% of the haters in this thread could do, and who probably can't even begin "understand" the problem.