Howardism | Interaction & Multimodal notes

Map of Content for the interaction-multimodal domain — 8 concepts. Curated entry point; see Home for all domains.

Encoder-Free Early Fusion — Multimodal design with minimal pre-processing instead of large standalone encoders: TML co-trains dMel audio + 40×40-patch hMLP + flow head in one transformer for 200ms latency; Gemma 4's 12B independently discards a 305M audio conformer for on-device memory — two labs, two motivations, same verdict, with a dense-text vision regression as the one measured cost
Full-Duplex Interaction — Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speech, live translation/commentary, time-aware speech — all special cases of model behavior
Interaction / Background Model Split — Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tools; rich-context-package delegation; "reasoning-model planning at non-thinking latency"
Interaction Models — Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via harness; interactivity scales with intelligence only if it's in the model
Interactivity Benchmarks — FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (visual proactivity); TML-Interaction-Small: 0.40s turn-taking latency, dominates interaction quality
Time-Aligned Micro-Turns — The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; streaming-sessions inference (upstreamed to SGLang), latency-tuned MoE kernels, bitwise trainer-sampler alignment
Turn-Based Interface Bottleneck — Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out by the interface, not the work; less-intelligent harness (VAD/turn-detection) should dissolve
Why AI Lags at Design — Andrew Ambrosino's four reasons frontier models are worse at visual/product design than at code: design is hard to grade (no clean reward like 'does it compile'), it sat outside the AI-research flywheel labs optimized for, it rewards novelty where code rewards known patterns, and it hides a design↔code abstraction layer (a rebrand is 263 components on the surface, semantic relationships underneath)

Interaction & Multimodal, in order.

Open questions 9 open