How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024)