The README only gives numbers without any information on what you’re supposed to do or how you are rated.
being cryptic and poorly specified is part of the assignment
just like real code
in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.
IMO the assignment('s purpose) could be improved by making the code significantly worse. Then you're testing the important stuff (dealing with ambiguity) that the AI can't do so well. Probably the reason they didn't do that is because it would make evaluation harder + more costly.
Basically it's a long enough problem that I'd be annoyed at being asked to do it at home for free, if what I wanted from that was a shot at an interview. If I had time on my hands though, it's something I could see trying for fun.
With a live interview, you get past a phone screening, and now the company is investing significant resources in the day or so of engineering time it takes to have people interview you. They won't do that unless they have a serious level of interest in you. The take-home means no investment for the company so there's a huge imbalance.
There's another thread about this article, which explains an analogous situation about being asked to read AI slop: https://zanlib.dev/blog/reliable-signals-of-honest-intent/
I suspect it would take me another hour to get it implemented. Leaving 30 minutes to figure out something clever?
Idk maybe I'm slow or really not qualified.
So yeah. They _could_ have written it much more clearly in the readme.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
Good. That should be the minimum requirement.
Not another Next.js web app take home project.
> I was reminding them that there is always someone smarter.
And even with this comment you literally do not understand that you have some skewed view of the world. Do you have some high school trauma?
I am not sure ad personam is appropriate here
> And even with this comment you literally do not understand that you have some skewed view of the world.
I’m well aware I don’t have a perfect view of reality and the map isn’t the territory. Do you?
It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf
It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...
But presumably similar principles apply.
This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.
Did you apply for a position? Did they send you the assignment without prior discussion?
Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.
Does help that I have a very public web presence and portfolio, though.
I couldn't care less about getting paid for a few hours, what's truly annoying when you're job hunting is the company having an extremely high rejection rate even at the take-home stage. That's an inordinate waste of time multiplied by a lot of companies.
If you have a >50% chance of rejecting, don't even give the candidate a take-home. Be at least 90% sure you want them before you get to that stage.
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
"before Claude Opus 4.5 started doing better than humans given only 2 hours"
"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
"Claude Opus 4.5 after 2 hours in our test-time compute harness"
"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
Would be quite curious to know though. How I usually design take home assignments is:
1. Candidate has several _days_ to complete (usually around a week).
2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
"do better than we have publicly admitted most of humanity can do, and we may deign to interview you"
It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.
Do you think if the applicants are really in that level of demand that they would be getting a take home test instead of being actively recruited?
Legitimately lay out your understanding of a world where an employer is chasing after employees who are high in demand, give them a test that is expected to take hours, and have a hedged bet in their wording, instead of saying we will absolutely hire you if you pass X bar?
There's more to employees than their raw ability to go below some performance threshold. If somebody passes the test, but lives in an US sanctioned country with no plans to move, is well known for using the n-word on social media or has previously broken an NDA, Anthropic probably doesn't want to interview them.
>If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
Not condescending
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code so we can schedule an interview.
> do better than we have publicly admitted most of humanity can do, and we may deign to interview you
If you think telling someone that they may have an interview after passing a test that 99.999% of humanity cannot pass, that they _may_ get an interview, you are being snarky/condescending.
If three people send them improvements, they'll probably get interviews. If three thousand do, the problem is easier than they thought or amenable to an LLM or one bright person figured out a trick and shared it with all his classmates or colleagues or all of GitHub.
That would be impressive.
After ~40 minutes, it got to:
The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
you can't compete with an AI on doing an AI performance benchmark?
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
Does this confirm they actually do knee cap models after the launch period to save money, without telling users?
If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.
Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?
Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?
Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.
I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.
Each ran the same spec headlessly in their native harness (one shot).
Results:
Agent Cycles Time
─────────────────────────────────────────────
gpt-5-2 2,124 16m
claude-opus-4-5-20251101 4,973 1h 2m
gpt-5-1-codex-max-xhigh 5,402 34m
gpt-5-codex 5,486 7m
gpt-5-1-codex 12,453 8m
gpt-5-2-codex 12,905 6m
gpt-5-1-codex-mini 17,480 7m
claude-sonnet-4-5-20250929 21,054 10m
claude-haiku-4-5-20251001 147,734 9m
gemini-3-pro-preview 147,734 3m
gpt-5-2-codex-xhigh 147,734 25m
gpt-5-2-xhigh 147,734 34m
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".the hot take is, there are other games.
In every other field it's helpful to understand the basics. I don't think software is the exception here.
Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)
This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.
Isn't that mostly because as you go up the abstraction layer, tools and docs to teach yourself the tricks of trade fast are in abundance (let alone a popular layer like React)? Which inturn is likely a function of incentives and opportunities.
¹https://github.com/anthropics/original_performance_takehome/...
²https://github.com/anthropics/original_performance_takehome/...
- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator
If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:
Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator.
However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:https://github.com/anthropics/original_performance_takehome/...
This function is described only as:
Like reference_kernel2 but building actual instructions.
Scalar implementation using only scalar ALU and load/store.
The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.
At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.
This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.
And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.
So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.
Does that help? Sounds like a fun exercise :)
Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.
I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.
1) Python is unreadable.
2) AI companies are content with slop and do not even bother with clear problem statements.
3) LOC and appearance matter, not goals or correctness.
4) Anthropic must be a horrible place to work at.
Would you prefer C or C++?
"2) AI companies are content with slop and do not even bother with clear problem statements."
It's a filter. If you don't get the problem, you'll waste their time.
"3) LOC and appearance matter, not goals or correctness."
The task was goal+correctness.
"4) Anthropic must be a horrible place to work at."
Depends on what you do. For this position it's probably one of the best companies to work at.
500 Internal Server Error