EEG2Video: Towards Decoding Dynamic Visual Perception from EEG Signals

EEG2Video: Towards Decoding Dynamic Visual Perception from EEG Signals

Overview

EEG2Video presents a framework for reconstructing dynamic visual content (videos) from brain activity recorded through EEG (Electroencephalography). The work focuses on understanding and decoding what a person sees by analyzing their neural signals.


Key Concepts and Examples

1. EEG Signals

EEG captures electrical activity from the brain using electrodes placed on the scalp.

Example: A participant watches a video of flowing water while their brain’s electrical activity is recorded.


2. Visual Stimuli

The system uses dynamic video clips, rather than static images, to explore how the brain processes continuous visual scenes.

Example: Instead of seeing a single picture of a bird, participants view a short video of birds flying across the sky.


3. Decoding Visual Perception

The core of EEG2Video is translating brain signals into video frames, effectively reconstructing an approximate version of what the person saw.

Example: Based on EEG patterns when a person watches a waterfall, the model generates video frames resembling flowing water.


Why This Problem Matters

  • Helps decode visual perception using non-invasive EEG.
  • Opens avenues for assistive tech and neuroscience research.
  • Offers a high temporal resolution alternative to fMRI.

Research Questions

  1. Can we decode dynamic visual perception from EEG signals?
  2. Is it possible to reconstruct temporally continuous visual experiences from EEG?
  3. How well do EEG signals preserve visual content in time?

Contributions

  • First end-to-end framework for reconstructing video from EEG.
  • Proposes GLMNet (Global-Local Mixture Network) for EEG encoding.
  • Leverages Transformer and diffusion models for high-fidelity video synthesis.

Methodology: Step-by-Step with Example

Step 1: Raw EEG Input

  • EEG: 32 channels × 256 samples = 2 seconds at 128 Hz
EEG = [
  [0.12, -0.08, ..., 0.09],  # Channel 1
  ...
  [0.05, -0.04, ..., 0.02]   # Channel 32
]

Step 2: Sliding Window

  • Window size = 64 samples (~0.5 sec)
  • Stride = 32 samples (~0.25 sec)
  • Windows: x1 to x7, each shape: (32, 64)
x1 = EEG[:, 0:64]
x2 = EEG[:, 32:96]
...
x7 = EEG[:, 192:256]

Step 3: EEG Encoder (GLMNet)

  • E_global(xi) from all channels
  • E_local(xi) from occipital channels
  • Output: ei ∈ ℝ^512
Sequence: E = [e1, e2, ..., e7]  # Shape: (7, 512)

Step 4: Seq2Seq Transformer

  • Input: ei + PE[i]
  • Output: ẑ0 ∈ ℝ^(7 × 768) (latent codes of predicted video frames)
  • GT: Video frames → VAE → z0 ∈ ℝ^(7 × 768)
  • Loss: MSE(ẑ0, z0)

Step 5: Semantic Alignment

  1. Caption from BLIP → “A bird flying.”
  2. CLIP text embedding: et ∈ ℝ^(77 × 768)
  3. Project EEG via MLP: ê_t
  4. Loss: MSE(ê_t, et)

Step 6: Dynamic-Aware Diffusion

zT = √αT·ẑ0 + √(1−αT)(√β·ϵs + √(1−β)·ϵd)
  • β = 0.2 (dynamic), 0.3 (static)
  • ϵs = same noise vector, ϵd = varied noise

Step 7: Tune-A-Video Frame Synthesis

Q = WQ · zvi,
K = WK · [zvi−1, zv1],
V = WV · [zvi−1, zv1]
  • Attention across current, first, and previous frames
  • Ensures temporal consistency

Full Pipeline Summary

Step Input Shape Output Shape Description
Raw EEG (32, 256) - 2s EEG signal
Sliding Windows 7 × (32, 64) - Temporal slicing
EEG Encoder (GLMNet) (32, 64) (512,) Mixed global/local features
Seq2Seq Transformer (7, 512) (7, 768) Temporal modeling
VAE Latents Video frames (7, 768) Used as supervision
Semantic Alignment (512,) (77, 768) EEG to caption embedding
Diffusion + Noise (7, 768) (7, 768) Noisy latents → generation-ready
Frame Decoder (7, 768) (7, 64, 64, 3) Final RGB video

📓 Jupyter Notebook

You can download and run the full EEG2Video demo notebook here:

Download EEG2Video_Jupyter.ipynb


Conclusion

EEG2Video bridges EEG decoding and generative video synthesis. With high temporal resolution, a global-local encoder, and advanced generative tools, it pushes the frontier of neural decoding from non-invasive data.