I rebuilt FlashAttention in Triton to understand the performance archaeology