Building reliable evals is the hardest part of LLM development. These are the frameworks I keep coming back to.
Tests whether a model follows explicit formatting and constraint instructions. 12 cases covering format, length, exclusion, and style constraints.
Probes for factual hallucination across 15 test cases. Covers confident false claims, source fabrication, and temporal confusion.