3 must-read papers & articles on AI agent harnesses, context engineering, and long-horizon reliability curated for builders.
Normal agents work fine at 200 lines of code but push them to 2000 lines and they completely lose track of what they were doing. Anthropic took inspiration from human engineers working in shifts and designed a better two-part harness to fix exactly this.
You keep adding memory to an agent thinking it will perform better. But this paper proves that by the 8th9th step, performance starts crashing badly. Capability and reliability are two very different things and most benchmarks only measure capability.
Prompt engineering is no longer just about finding the right words. Context engineering means optimally curating all the tokens available to the LLM so it consistently produces the desired behavior across every step of a long task.