+ + Dreamer-V3 must first train its Encoder and Decoder networks + to accurately reconstruct pixel-level observations. This reconstruction objective delays the + actual Actor-Critic training, requiring millions of environment steps before + the world model produces useful latent representations. The decoder alone adds substantial + computational overhead while modeling irrelevant visual details. +
++ By replacing the trainable encoder with a frozen V-JEPA backbone, we eliminate + the need for pixel reconstruction entirely. This dramatically reduces trainable parameters + (no encoder training, no decoder needed), saving compute while potentially increasing + generalization due to V-JEPA's pretraining on millions of diverse videos. The agent + can immediately leverage "adult-level" visual understanding. +
+- A significant risk of using a completely frozen encoder is the potential filtering of tiny, task-relevant details. A small red light might be statistically insignificant in general internet video data (V-JEPA's training set) but critical for a specific RL task (e.g., a braking signal). -
- -- To mitigate this, we insert lightweight Trainable Adapters (Low-Rank Adaptation or similar) into the JEPA backbone. This allows the RL signal to tune attention toward task-specific features without destroying the pretrained general knowledge, maintaining the "adult-level" visual processing while allowing for task specialization. +
+ A significant risk of using a completely frozen encoder is the potential filtering of tiny, task-relevant details. A small red light might be statistically insignificant in general internet video data (V-JEPA's training set) but critical for a specific RL task (e.g., a braking signal).
+ ++ To mitigate this, we insert lightweight Trainable Adapters (Low-Rank Adaptation or similar) into the JEPA backbone. This allows the RL signal to tune attention toward task-specific features without destroying the pretrained general knowledge, maintaining the "adult-level" visual processing while allowing for task specialization. +
++ Without a decoder to reconstruct pixel representations, it becomes significantly harder to validate that the hidden state actually represents the world state accurately. In standard Dreamer, poor reconstruction quality serves as a clear diagnostic signal that something is wrong with the latent representations. Removing this feedback loop makes debugging and verification more challenging. +
+ ++ We propose using proxy validation metrics to ensure representation quality: +
+