PA bench: Evaluating web agents on real world personal assistant workflows