Howardismvol. 03 · quiet corner of the web

Plate II機器翻譯 · machine-translatedENHOWARDISM

Encoder-Free Early Fusion

PublishedMay 13, 2026FiledConceptTagsLLM ArchitectureMultimodalReading3 minSourceAI-synthesised

以最少預處理取代大型獨立編碼器的多模態設計：dMel 音訊嵌入、40×40 patch hMLP 處理影像幀、flow head 產生音訊輸出，全部在單一 transformer 中從頭共同訓練

Encoder-Free Early Fusion 概念示意圖

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

摘要#

Interaction Models 中的一項多模態設計選擇：不再將音訊和視訊路由到大型獨立編碼器（也不再透過獨立的 TTS 類解碼器輸出音訊），而是使用最少預處理搭配單一 transformer，所有元件從頭共同訓練。「Encoder-free」是相對的說法——仍有輕量嵌入層——但不存在 Whisper 規模的音訊編碼器或獨立的 TTS 模型。

各元件（單次 200ms micro-turn）#

輸入（文字 / 影像幀 / 音訊的任意子集）：

文字 → token 嵌入（標準做法）。
影像 / 視訊幀 → 切分為 40×40 patches，由 hMLP 編碼（Touvron et al. 2022）。
音訊 → 以 dMel 表示（Bai et al. 2024），經輕量嵌入層（「bag of embeddings」）轉換。

單一共享 Transformer 消化融合後的輸入。

輸出：

文字 → unembedding（標準做法）。
音訊 → 透過 flow head（Lipman et al. 2022）產生 mel。

所有元件與 transformer 從頭共同訓練——並非由預訓練編碼器/解碼器拼接而成。

為何重要#

避免大型獨立編碼器/解碼器帶來的延遲與複雜度——當你每 200ms 就要執行一次時尤為關鍵（見 Time-Aligned Micro-Turns）。
Early fusion（所有模態進入同一個 transformer）意味著模型能跨模態聯合推理，而非僅對編碼器預消化的輸出進行推理——這是實現「一邊說話一邊對視覺線索做出反應」等能力的前提（見 Full-Duplex Interaction）。
從頭共同訓練符合 The Bitter Lesson 的精神：減少手工設計的模組邊界，更多端到端學習。

相關連結#

Interaction Models — 上層概念
Time-Aligned Micro-Turns — 為何最少預處理是硬性需求（200ms 預算）
Interaction / Background Model Split — 架構的另一半
Full-Duplex Interaction — 跨模態聯合推理使視覺＋音訊插話成為可能
The Bitter Lesson — 「從頭共同訓練、捨棄模組化編碼器」正是 bitter-lesson 的實踐
TML-Interaction-Small — 實作此設計的模型（dMel 音訊、40×40 hMLP patches、flow head），從頭共同訓練

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 7

Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

Related articles

Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

Related articles

Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

Cited by 7

Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…