Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Related Stories

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

ReasoningGym: Reasoning Environments for RL with Verifiable Rewards

Beyond the Black Box: Interpretability of LLMs in Finance

TradeExpert, a trading framework that employs Mixture of Expert LLMs

From tokens to thoughts: How LLMs and humans trade compression for meaning

What do software developers need to know to succeed in an age of AI?

Beyond Attention: Toward Machines with Intrinsic Higher Mental States

How to Grow an LSM-tree? Towards Bridging the Gap Between Theory and Practice

LLMs replacing human participants harmfully misportray, flatten identity groups

AI Persona Groupthink Makes Group Talk More Realistic

3D CAD from Images, Text, and Point Clouds with RLVR

TLOB: Dual Attention Transformer Predicts Price Trends from Order Book Data

How much do language models memorize?

Oh fuck! How do people feel about robots that leverage profanity?

Not all tokens are meant to be forgotten

Extreme Super-Resolution via Scale Autoregression and Preference Alignment

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Prompting for AI Agents [video]

Show HN: Container Use for Agents

Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology - Nature Cancer

Open source CLI tool for CodeAct agents

Top 7 Python Frameworks for AI Agents

Research found individuals who reduced their weight from overweight to a healthy range during midlife, without medications or surgery, experienced meaningful long-term health benefits

Why I'm excited about Go for agents

I’m Open-Sourcing my Custom Benchmark GUI

js-engine-benchmark: Rust Boa vs. Zig Kiesel

Tired of Scrolling Through Long AI Chat Histories? Meet Prompt Navigator!

Tired of Scrolling Through Long AI Chat Histories? Meet Prompt Navigator!

MargaritaImageGen – Terminal-Based Bing Image Generator (Perfect for AI Agents )

Autonomous drone defeats human champions in racing first