H
Howardismvol. 03 · quiet corner of the web
Plate IIArchitectureHOWARDISM

MOC — Interaction & Multimodal

PublishedMay 25, 2026FiledConceptTopicArchitectureReading2 minSourceAI-synthesised

<!-- BEGIN GENERATED: moc -->

Illustration for MOC — Interaction & Multimodal

Interaction & Multimodal#

Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.

<!-- BEGIN GENERATED: moc -->

  • Encoder-Free Early Fusion — Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch hMLP for frames, flow head for audio out, all co-trained from scratch in one transformer
  • Full-Duplex Interaction — Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speech, live translation/commentary, time-aware speech — all special cases of model behavior
  • Interaction / Background Model Split — Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tools; rich-context-package delegation; "reasoning-model planning at non-thinking latency"
  • Interaction Models — Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via harness; interactivity scales with intelligence only if it's in the model
  • Interactivity Benchmarks — FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (visual proactivity); TML-Interaction-Small: 0.40s turn-taking latency, dominates interaction quality
  • Time-Aligned Micro-Turns — The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; streaming-sessions inference (upstreamed to SGLang), latency-tuned MoE kernels, bitwise trainer-sampler alignment
  • Turn-Based Interface Bottleneck — Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out by the interface, not the work; less-intelligent harness (VAD/turn-detection) should dissolve <!-- END GENERATED: moc -->
§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Related articles
  • Full-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • TML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

  • Interaction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…