H
Howardism
Plate II機器翻譯 · machine-translatedENHOWARDISM

Encoder-Free Early Fusion

PublishedMay 13, 2026FiledConceptTagsLLM ArchitectureMultimodalReading3 minSourceAI-synthesised

以最少預處理取代大型獨立編碼器的多模態設計:dMel 音訊嵌入、40×40 patch hMLP 處理影像幀、flow head 產生音訊輸出,全部在單一 transformer 中從頭共同訓練

Encoder-Free Early Fusion 概念示意圖

資料來源#

摘要#

Interaction Models 中的一項多模態設計選擇:不再將音訊和視訊路由到大型獨立編碼器(也不再透過獨立的 TTS 類解碼器輸出音訊),而是使用最少預處理搭配單一 transformer,所有元件從頭共同訓練。「Encoder-free」是相對的說法——仍有輕量嵌入層——但不存在 Whisper 規模的音訊編碼器或獨立的 TTS 模型。

各元件(單次 200ms micro-turn)#

輸入(文字 / 影像幀 / 音訊的任意子集):

  • 文字 → token 嵌入(標準做法)。
  • 影像 / 視訊幀 → 切分為 40×40 patches,由 hMLP 編碼(Touvron et al. 2022)。
  • 音訊 → 以 dMel 表示(Bai et al. 2024),經輕量嵌入層(「bag of embeddings」)轉換。

單一共享 Transformer 消化融合後的輸入。

輸出:

  • 文字 → unembedding(標準做法)。
  • 音訊 → 透過 flow headLipman et al. 2022)產生 mel。

所有元件與 transformer 從頭共同訓練——並非由預訓練編碼器/解碼器拼接而成。

為何重要#

  • 避免大型獨立編碼器/解碼器帶來的延遲與複雜度——當你每 200ms 就要執行一次時尤為關鍵(見 Time-Aligned Micro-Turns)。
  • Early fusion(所有模態進入同一個 transformer)意味著模型能跨模態聯合推理,而非僅對編碼器預消化的輸出進行推理——這是實現「一邊說話一邊對視覺線索做出反應」等能力的前提(見 Full-Duplex Interaction)。
  • 從頭共同訓練符合 The Bitter Lesson 的精神:減少手工設計的模組邊界,更多端到端學習。

相關連結#

資料來源#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 7
  • Full-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • Interaction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Interaction & Multimodal

    Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.

  • The Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • TML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

Related articles
  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Interactivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • Full-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • Interaction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…