LMArena is a cancer on AI

SWE-Bench Failures: When Coding Agents Spiral into 693 Lines of Hallucinations

HellaSwag: 36% of this popular large language model benchmark contains errors

Evaluation of TikTok vs. Instagram Reels

30% of Google's Emotions Dataset Is Mislabeled

Generating Children’s Stories Using GPT-3 and DALL·E

I wanted burritos. Facebook Search sent me to a dead restaurant 45m away

We asked 100 humans to draw the DALL·E prompts

Google Search Is Falling Behind, Especially in Code, Cooking and Travel

Building a no-code toxicity classifier by talking to GitHub Copilot

Are popular toxicity models simply profanity detectors?

Is Google Search Deteriorating? Measuring Google's Search Quality in 2022

Examples of the Importance of Context-Sensitivity in Data-Centric AI