EEG2Video: Towards Decoding Dynamic Visual Perception from EEG Signals

Overview

EEG2Video presents a framework for reconstructing dynamic visual content (videos) from brain activity recorded through EEG (Electroencephalography). The work focuses on understanding and decoding what a person sees by analyzing their neural signals.

Key Concepts and Examples

1. EEG Signals

EEG captures electrical activity from the brain using electrodes placed on the scalp.

Example: A participant watches a video of flowing water while their brain’s electrical activity is recorded.

2. Visual Stimuli

The system uses dynamic video clips, rather than static images, to explore how the brain processes continuous visual scenes.

Example: Instead of seeing a single picture of a bird, participants view a short video of birds flying across the sky.

3. Decoding Visual Perception

The core of EEG2Video is translating brain signals into video frames, effectively reconstructing an approximate version of what the person saw.

Example: Based on EEG patterns when a person watches a waterfall, the model generates video frames resembling flowing water.

Why This Problem Matters

Helps decode visual perception using non-invasive EEG.
Opens avenues for assistive tech and neuroscience research.
Offers a high temporal resolution alternative to fMRI.

Research Questions

Can we decode dynamic visual perception from EEG signals?
Is it possible to reconstruct temporally continuous visual experiences from EEG?
How well do EEG signals preserve visual content in time?

Contributions

First end-to-end framework for reconstructing video from EEG.
Proposes GLMNet (Global-Local Mixture Network) for EEG encoding.
Leverages Transformer and diffusion models for high-fidelity video synthesis.

Methodology: Step-by-Step with Example

Step 1: Raw EEG Input

EEG: 32 channels × 256 samples = 2 seconds at 128 Hz

EEG = [
  [0.12, -0.08, ..., 0.09],  # Channel 1
  ...
  [0.05, -0.04, ..., 0.02]   # Channel 32
]

Step 2: Sliding Window

Window size = 64 samples (~0.5 sec)
Stride = 32 samples (~0.25 sec)
Windows: x1 to x7, each shape: (32, 64)

x1 = EEG[:, 0:64]
x2 = EEG[:, 32:96]
...
x7 = EEG[:, 192:256]

Step 3: EEG Encoder (GLMNet)

E_global(xi) from all channels
E_local(xi) from occipital channels
Output: ei ∈ ℝ^512

Sequence: E = [e1, e2, ..., e7]  # Shape: (7, 512)

Step 4: Seq2Seq Transformer

Input: ei + PE[i]
Output: ẑ0 ∈ ℝ^(7 × 768) (latent codes of predicted video frames)
GT: Video frames → VAE → z0 ∈ ℝ^(7 × 768)
Loss: MSE(ẑ0, z0)

Step 5: Semantic Alignment

Caption from BLIP → “A bird flying.”
CLIP text embedding: et ∈ ℝ^(77 × 768)
Project EEG via MLP: ê_t
Loss: MSE(ê_t, et)

Step 6: Dynamic-Aware Diffusion

zT = √αT·ẑ0 + √(1−αT)(√β·ϵs + √(1−β)·ϵd)

β = 0.2 (dynamic), 0.3 (static)
ϵs = same noise vector, ϵd = varied noise

Step 7: Tune-A-Video Frame Synthesis

Q = WQ · zvi,
K = WK · [zvi−1, zv1],
V = WV · [zvi−1, zv1]

Attention across current, first, and previous frames
Ensures temporal consistency

Full Pipeline Summary

Step	Input Shape	Output Shape	Description
Raw EEG	(32, 256)	-	2s EEG signal
Sliding Windows	7 × (32, 64)	-	Temporal slicing
EEG Encoder (GLMNet)	(32, 64)	(512,)	Mixed global/local features
Seq2Seq Transformer	(7, 512)	(7, 768)	Temporal modeling
VAE Latents	Video frames	(7, 768)	Used as supervision
Semantic Alignment	(512,)	(77, 768)	EEG to caption embedding
Diffusion + Noise	(7, 768)	(7, 768)	Noisy latents → generation-ready
Frame Decoder	(7, 768)	(7, 64, 64, 3)	Final RGB video

📓 Jupyter Notebook

You can download and run the full EEG2Video demo notebook here:

Download EEG2Video_Jupyter.ipynb

Conclusion

EEG2Video bridges EEG decoding and generative video synthesis. With high temporal resolution, a global-local encoder, and advanced generative tools, it pushes the frontier of neural decoding from non-invasive data.