Optimizing a WebGPU Matmul Kernel for 1 TFLOP