Howardism · Vol. 03Plate II · No. 02
LLM Evaluation, tagged.
Notes3TagLLM EvaluationOldest14 Apr 2026Newest23 May 2026
Every article tagged llm evaluation, newest first.
| Title | Summary | Date |
|---|---|---|
| The Verifiability Thesis | LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peaks; "verifiable + labs care"; everything eventually verifiable | |
| Interactivity Benchmarks | FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (visual proactivity); TML-Interaction-Small: 0.40s turn-taking latency, dominates interaction quality | |
| Scale-Dependent Prompt Sensitivity | Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26pp and fully reverse hierarchy on GSM8K/MMLU-STEM |