H
Howardismvol. 03 · quiet corner of the web
Howardism · Vol. 03Plate II · No. 02

LLM Evaluation, tagged.

Notes3TagLLM EvaluationOldest14 Apr 2026Newest23 May 2026

Every article tagged llm evaluation, newest first.

Articles tagged LLM Evaluation, sorted by date, newest first.
TitleSummaryDate
The Verifiability ThesisLLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peaks; "verifiable + labs care"; everything eventually verifiable
Interactivity BenchmarksFD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (visual proactivity); TML-Interaction-Small: 0.40s turn-taking latency, dominates interaction quality
Scale-Dependent Prompt SensitivityLarge models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26pp and fully reverse hierarchy on GSM8K/MMLU-STEM