Show HN: Autonomous recovery for distributed training jobs