Model Evaluation

Benchmarks, testing harnesses, and responsible rollout practices.