/s
If you’re trying to learn CUDA, then using a higher level language is not the best approach. If you already used a high level language and found that your performance is lacking and could be better if you could fuse some of your kernels, and avoid some of the memory round-trips, then moving to something lower level is called for.
I’m suggesting it’s better to think about your goals for one minute and understand the basic choices than it is to assume there’s something that works for everyone’s goals, and higher level languages don’t meet everyone’s goals.
Each one has their place.
I started learning about GPU and CUDA from this book recently, and I agree the writing is confusing, and code examples have errors. However, it is still a nice reference about many types of algorithms for heterogeneous memory devices, it helped me understand better some patterns for CPUs.
Which of these - warp, numba, cp, is the best bet for a beginner?
Massively Parallel Processors: A Hands-on Approach is not really good in my opinion, many small mistakes and confusing sentences (even when you know cuda).
CUDA by Example: An Introduction to General-Purpose GPU Programming is too simple and abstract too much the architecture.
Next year I'm planning to start writing a cuda book that starts by engineering the hardware, and goes up to the optimization part on that harware (which is basically a nvidia card) including all the main algorithms (except for graphs).
I'm already teaching the course in this way at uni, and it is quite successful among students.
I would separate the knowledge into maybe 3 distinct buckets.
The baseline: device/host boundary, SIMT programming etc.
The intermediate: kernel architecture, CUDA graph vs persistent kernels, warp specialisation/divergence avoidance techniques etc.
The advanced: architecture specifics so tcgen05, TMA, SMEM/HBM, memory throughput vs compute biases in various arch impls., GEMM, FHMA, all the tricks that make modern fused kernels very fast. Also would bucket most GPU Direct RDMA/GPU NetIO/friends here too.
The baseline hasn't changed much and probably won't, the intermediate knowledge has also remained pretty reliably stable for ~10 years with only things like graphs changing stuff. Tile might become more relevant than it is today but for now CUDA, cuBLAS, friends are where it's worth investing knowledge.
This would require very different (re-written?) kernels than a few years back, wouldn't it?
Would you have any good resources on the topic?
The major difference is that this doesn't have a natural moment to transform/touch the values after read from global and before storage to shared.
Otherwise, tiled MMA (gemm) kernels where normal even in Maxwell days (after the classic K80, before the P100; Maxwell is when H.265 support landed).
For higher level patterns again the APIs are the same so anything building on libibverbs or aforementioned ucx etc are pretty compatible from a high level ideas perspective. If you are new to RDMA in general definitely start with raw verbs instead of using abstractions like MPI if you really want to build a good intuition and then move to MPI once you understand what it is doing for you.
So tl;dr, you have at least one person who would pay for a better book :-)
I always appreciate book lists like this one, but having a small targeted list is more practical for those of us with limited reading time.
What makes CUDA Programming: A Developer's Guide to Parallel Computing with GPUs better among its peers?
https://docs.nvidia.com/cuda/cuda-programming-guide/pdf/cuda...
In this day and age when programming is so accessible, why not have a more tempting pitch than just book titles categorized by difficulty.
With CUDA, you can make Nvidia GPUs go brrrr.
Oh. And thereby, incidentally conquer the compute world.
Understand everything he talks about and you understand CUDA.
An expensive NVIDIA GPU is required only if your purpose is not just to learn, but to actually do useful graphics or ML/AI work.