Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022)

How to optimize a CUDA matmul kernel for cuBLAS-like performance (2022)

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog