Working with the Neuroprobe Benchmark#
Neuroprobe is a standardized benchmark for evaluating neural decoding models on human intracranial EEG (iEEG) data. It defines 15 tasks spanning audio, language, and vision domains, and supports both binary and multiclass classification label modes. This benchmark is derived from the BrainTreebank dataset — 40 hours of sEEG recordings from 10 human subjects watching naturalistic movies.
For full details, see the Neuroprobe paper and the updated OpenReview submission.
Preparing the data#
Download and process the Neuroprobe data using the brainsets CLI:
brainsets prepare neuroprobe_2025
Note
Processing includes downloading raw BrainTreebank data and computing all
benchmark splits. This may take several hours depending on your hardware and
network connection. Use --cores <N> to parallelize.
Key concepts#
Tasks. Each task uses a 1-second window of neural data aligned to a word onset and can be evaluated in either binary or multiclass label mode. Available tasks:
Auditory tasks#
Task |
Binary labels |
Multiclass labels |
|---|---|---|
volume |
0 - low
1 - high
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
pitch |
0 - low
1 - high
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
delta_volume |
0 - low
1 - high
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
Language tasks#
Task |
Binary labels |
Multiclass labels |
|---|---|---|
|
1 - speech
0 - nonverbal
|
Not used (task remains binary) |
|
1 - sentence-start
0 - nonverbal
|
Not used (task remains binary) |
gpt2_surprisal |
0 - low
1 - high
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
word_length |
0 - short
1 - long
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
word_gap |
0 - short
1 - long
|
0 - short gap (<25th percentile)
1 - medium gap (37.5th-62.5th)
2 - long gap (>=75th percentile)
|
word_index |
0 - first word
1 - other word
|
0 - first word
1 - second word
2 - any later word
|
word_head_pos |
1 -
bin_head == 00 -
bin_head == 1 |
Not used (task remains binary) |
|
1 - verb
0 - non-verb
|
0 - noun
1 - verb
2 - pronoun
3 - determiner
4 - adjective
5 - adverb
|
Visual tasks#
Task |
Binary labels |
Multiclass labels |
|---|---|---|
frame_brightness |
0 - low
1 - high
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
global_flow |
0 - low
1 - high
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
local_flow |
0 - low
1 - high
|
0 - low (<25th percentile)
1 - medium (37.5th-62.5th)
2 - high (>=75th percentile)
|
face_num |
0 - no faces
1 - one or more
|
0 - no faces
1 - exactly one face
2 - more than one face
|
Label mode. Neuroprobe supports both binary and multiclass classification.
Use label_mode="multiclass" when constructing
Neuroprobe2025 to load multiclass splits.
Regimes (split types). Neuroprobe defines three evaluation regimes that test different levels of generalization:
SS-SM (within-session): Train and test on data from the same subject watching the same movie. Uses 2-fold cross-validation on contiguous blocks to prevent temporal autocorrelation leakage.
SS-DM (cross-session): Train on one movie session and test on a different movie from the same subject.
DS-DM (cross-subject): Train on a fixed anchor recording (Subject 2, Trial 4) and test on a different subject and movie. This is the most challenging regime.
The default leaderboard ranking uses the cross-session (SS-DM) split.
Subset tiers. Three subset sizes control the number of subject/trial pairs and electrodes included:
"full": All eligible subject/trial pairs."lite": A curated subset of 6 subjects with 2 trials each (12 sessions), capped at 120 electrodes per subject. This is the standard benchmark configuration."nano": A single trial per subject for rapid prototyping. Only supports the within-session regime.
Metric. AUROC (Area Under the ROC Curve) remains the primary metric in binary mode. For multiclass evaluations, follow the current Neuroprobe evaluation protocol and leaderboard reporting guidelines.
Loading benchmark splits#
The Neuroprobe2025 class handles split resolution
automatically. Specify the benchmark parameters to get the correct train/test
partition:
from brainsets.datasets import Neuroprobe2025
train_ds = Neuroprobe2025(
subset_tier="lite",
test_subject=1,
test_session=1,
split="train",
label_mode="multiclass",
task="word_part_speech",
regime="SS-DM",
)
test_ds = Neuroprobe2025(
subset_tier="lite",
test_subject=1,
test_session=1,
split="test",
label_mode="multiclass",
task="word_part_speech",
regime="SS-DM",
)
The constructor resolves which recordings to load and which channel subset and time intervals to use based on the requested split.
Within-session (SS-SM) uses 2-fold cross-validation. You can iterate over folds:
from brainsets.datasets import Neuroprobe2025
for fold in range(Neuroprobe2025.num_folds_for_regime("SS-SM")):
train_ds = Neuroprobe2025(
subset_tier="lite",
test_subject=1,
test_session=1,
split="train",
task="speech",
regime="SS-SM",
fold=fold,
)
test_ds = Neuroprobe2025(
subset_tier="lite",
test_subject=1,
test_session=1,
split="test",
task="speech",
regime="SS-SM",
fold=fold,
)
Accessing neural data and labels#
Each recording exposes sEEG data as a RegularTimeSeries
sampled at 2048 Hz, along with split-specific sampling intervals and
channel inclusion masks:
intervals = train_ds.get_sampling_intervals()
for recording_id, interval in intervals.items():
rec = train_ds.get_recording(recording_id)
print(rec.seeg_data.data.shape)
print(interval.start[:5], interval.end[:5])
print(interval.label[:5])
The interval.label array contains class labels for each trial window
(binary or multiclass, depending on label_mode).
Channel metadata (electrode names, coordinates, inclusion masks) is available
via get_channel_metadata():
meta = train_ds.get_channel_metadata(recording_id)
print(meta["names"])
print(meta["coords"]) # LIP coordinates
print(meta["included_mask"]) # benchmark electrode subset
Loading raw recordings#
If you want access to full continuous recordings without benchmark splits
(e.g. for pre-training), pass explicit recording_ids:
from brainsets.datasets import Neuroprobe2025
ds = Neuroprobe2025(recording_ids=["sub_1_trial001", "sub_2_trial004"])
In this mode, no split/task/regime resolution is performed; you get the complete neural data for the requested sessions.
Running a complete benchmark evaluation#
A typical benchmark loop iterates over all tasks, regimes, and subject/trial pairs. Here is a minimal skeleton:
from brainsets.datasets import Neuroprobe2025
from brainsets.datasets.Neuroprobe2025 import (
VALID_TASKS,
NEUROPROBE_LITE_SUBJECT_TRIALS,
)
regime = "SS-DM"
results = {}
for task in VALID_TASKS:
for subject, session in sorted(NEUROPROBE_LITE_SUBJECT_TRIALS):
for fold in range(Neuroprobe2025.num_folds_for_regime(regime)):
train_ds = Neuroprobe2025(
subset_tier="lite",
test_subject=subject,
test_session=session,
split="train",
task=task,
regime=regime,
fold=fold,
)
test_ds = Neuroprobe2025(
subset_tier="lite",
test_subject=subject,
test_session=session,
split="test",
task=task,
regime=regime,
fold=fold,
)
# Train your model on train_ds, evaluate on test_ds
# auroc = evaluate(model, test_ds)
# results[(task, subject, session, fold)] = auroc
Report the mean AUROC across all subject/session pairs for each task, along with the overall mean. Submit results to the Neuroprobe leaderboard following the instructions in the Neuroprobe code repository.
References#
@article{zahorodnii2025neuroprobe,
title={Neuroprobe: Evaluating Intracranial Brain Responses to Naturalistic Stimuli},
author={Zahorodnii, Andrii and Wang, Christopher and Stankovits, Bennett
and Moraitaki, Charikleia and Chau, Geeling and Barbu, Andrei
and Katz, Boris and Fiete, Ila R},
journal={arXiv preprint arXiv:2509.21671},
year={2025}
}