diff --git a/proposal/dreamer-basic-arch.png b/proposal/dreamer-basic-arch.png new file mode 100644 index 0000000..f053420 Binary files /dev/null and b/proposal/dreamer-basic-arch.png differ diff --git a/proposal/index.html b/proposal/index.html index 14b2786..b15f051 100644 --- a/proposal/index.html +++ b/proposal/index.html @@ -189,6 +189,54 @@ ); + const Motivation = () => ( +
+
+

The Bottleneck in Standard Dreamer

+ +
+ Dreamer Architecture Comparison +
+ Figure: (a) World Model Learning phase requires training encoder/decoder for reconstruction, + (b) Actor-Critic Learning can only begin after world model converges +
+
+ +
+
+

+ ⚠️ The Problem +

+

+ Dreamer-V3 must first train its Encoder and Decoder networks + to accurately reconstruct pixel-level observations. This reconstruction objective delays the + actual Actor-Critic training, requiring millions of environment steps before + the world model produces useful latent representations. The decoder alone adds substantial + computational overhead while modeling irrelevant visual details. +

+
+ +
+

+ The V-JEPA Solution +

+

+ By replacing the trainable encoder with a frozen V-JEPA backbone, we eliminate + the need for pixel reconstruction entirely. This dramatically reduces trainable parameters + (no encoder training, no decoder needed), saving compute while potentially increasing + generalization due to V-JEPA's pretraining on millions of diverse videos. The agent + can immediately leverage "adult-level" visual understanding. +

+
+
+
+
+ ); + const ArchitectureViewer = () => { const [mode, setMode] = useState('jepa'); // 'standard' or 'jepa' @@ -345,22 +393,55 @@ const Challenges = () => (
-
-
-
- -
-
-

Critical Challenge: The "Red Light" Problem

-

- A significant risk of using a completely frozen encoder is the potential filtering of tiny, task-relevant details. A small red light might be statistically insignificant in general internet video data (V-JEPA's training set) but critical for a specific RL task (e.g., a braking signal). -

- -
-

Proposed Solution: Trainable Adapters

-

- To mitigate this, we insert lightweight Trainable Adapters (Low-Rank Adaptation or similar) into the JEPA backbone. This allows the RL signal to tune attention toward task-specific features without destroying the pretrained general knowledge, maintaining the "adult-level" visual processing while allowing for task specialization. +

Critical Challenges & Risks

+ +
+ {/* Challenge 1: Red Light Problem */} +
+
+
+ +
+
+

Challenge 1: The "Red Light" Problem

+

+ A significant risk of using a completely frozen encoder is the potential filtering of tiny, task-relevant details. A small red light might be statistically insignificant in general internet video data (V-JEPA's training set) but critical for a specific RL task (e.g., a braking signal).

+ +
+

Proposed Solution: Trainable Adapters

+

+ To mitigate this, we insert lightweight Trainable Adapters (Low-Rank Adaptation or similar) into the JEPA backbone. This allows the RL signal to tune attention toward task-specific features without destroying the pretrained general knowledge, maintaining the "adult-level" visual processing while allowing for task specialization. +

+
+
+
+
+ + {/* Challenge 2: Validation Problem */} +
+
+
+ +
+
+

Challenge 2: The Validation Problem

+

+ Without a decoder to reconstruct pixel representations, it becomes significantly harder to validate that the hidden state actually represents the world state accurately. In standard Dreamer, poor reconstruction quality serves as a clear diagnostic signal that something is wrong with the latent representations. Removing this feedback loop makes debugging and verification more challenging. +

+ +
+

Proposed Solution: Alternative Validation Methods

+

+ We propose using proxy validation metrics to ensure representation quality: +

+
    +
  • Latent prediction accuracy: Measure how well future latent states are predicted in V-JEPA space
  • +
  • Downstream task performance: Monitor RL reward signals and convergence speed as indirect validation
  • +
  • Probing classifiers: Train lightweight probes to predict known world properties (object positions, states) from latents
  • +
  • Optional sparse decoding: Periodically reconstruct a small batch of frames for qualitative inspection
  • +
+
@@ -389,6 +470,7 @@