Evaluating the Evaluation: A Benchmarking Checklist

Related Stories

Benchmarking is hard, sometimes

Why hasn’t partial evaluation been applied to Pandas?

Benchmarking Strategies for Non-Standard Cognitive Architectures

Wrote about benchmarking and profiling in golang

To Mock Or Not To Mock Your Auth: The Checklist

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Canadian universities grapple with evaluating students amid AI cheating fears

NASA ‘evaluating’ opportunities to launch rockets to Mars during Trump presidency

3 Years of Remote Work

Benchmarking Zasper versus JupyterLab

Benchmarking Crimes Meet Formal Verification

Does tust define evaluation order?

Recursive Data Structures and Lazy Evaluation

OpenBSD IO Benchmarking: How Many Jobs Are Worth It?

Small but powerful dummy Object generator for Testing & Benchmarking!

Checklist for software engineers who think there's no growth without working at scale

HealthBench – An evaluation for AI systems and human health

`overflow evaluating the requirement` when using simple trait bounds for error handling

Can You Trust Code Copilots? Evaluating LLMs from a Code Security Perspec

CMU TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Show HN: Pi Co-pilot – Evaluation of AI apps made easy

9 Lazy Evaluation Features in Python That Optimize Your Code Quietly

Design and evaluation of a parrot-to-parrot video-calling system (2023)

Querying 10M rows in 11 seconds: Benchmarking ConnectorX, Asyncpg and Psycopg vs QuestDB

Re-evaluating Fan-Out-on-Write vs. Fan-Out-on-Read Under Celebrity Traffic Spikes (2025)

The serum evaluation of sex hormones including DHEAs, DHT, testosterone in oral lichen planus patients

Doom GPU Flame Graphs

Benchmarking via github actions

Evaluating Agent-Based Program Repair at Google