Evaluating Language-Model Agents on Realistic Autonomous Tasks

Update on ARC's recent eval efforts