progscrape: surgehq.ai

LMArena is a cancer on AI

54 days ago surgehq.ai ai cancer

SWE-Bench Failures: When Coding Agents Spiral into 693 Lines of Hallucinations

5 months ago surgehq.ai

HellaSwag: 36% of this popular large language model benchmark contains errors

3 years ago surgehq.ai

Evaluation of TikTok vs. Instagram Reels

3 years ago surgehq.ai instagram tiktok

30% of Google's Emotions Dataset Is Mislabeled

3 years ago surgehq.ai google

Generating Children’s Stories Using GPT-3 and DALL·E

3 years ago surgehq.ai

I wanted burritos. Facebook Search sent me to a dead restaurant 45m away

3 years ago surgehq.ai facebook

We asked 100 humans to draw the DALL·E prompts

3 years ago surgehq.ai

Google Search Is Falling Behind, Especially in Code, Cooking and Travel

3 years ago surgehq.ai google

Building a no-code toxicity classifier by talking to GitHub Copilot

3 years ago surgehq.ai github

Are popular toxicity models simply profanity detectors?

4 years ago surgehq.ai

Is Google Search Deteriorating? Measuring Google's Search Quality in 2022

4 years ago surgehq.ai google

Examples of the Importance of Context-Sensitivity in Data-Centric AI

4 years ago surgehq.ai ai