generate_string_kfold_assignment#

brainsets.utils.split.generate_string_kfold_assignment(string_id, n_folds=3, val_ratio=0.2, seed=42)[source]#

Generate deterministic per-fold train/valid/test assignments for one ID.

The assignment is independent for each fold index k, but follows a deterministic two-step rule:

  1. Compute a global bucket from md5(f"{string_id}_{seed}") % n_folds. The fold whose index equals this bucket is labeled "test".

  2. For every other fold, compute a fold-specific hash md5(f"{string_id}_{seed}_{k}") and map it to [0, 1). If that value is below val_ratio, the fold is "valid", otherwise it is "train".

As a result, each string_id appears in the test split for exactly one fold and is never in test for the remaining folds. This makes the output reproducible across runs and safe for parallel processing.

Parameters:
  • string_id (str) – String identifier (e.g., “S001”, “sub-01”, or “sub-01_ses-01”).

  • n_folds (int) – Number of folds for cross-validation. Default is 3.

  • val_ratio (float) – Ratio of validation set relative to train+valid combined. Default is 0.2.

  • seed (int) – Random seed for reproducibility. Default is 42.

Returns:

List of fold assignments where index k corresponds to fold k and each value is one of "train", "valid", or "test". Exactly one entry is "test".

Return type:

List[str]

Examples

>>> assignments = generate_string_kfold_assignment("sub-01", n_folds=3)
>>> assignments
['train', 'test', 'train']
>>> generate_string_kfold_assignment("sub-01_ses-01", n_folds=3)
['valid', 'train', 'test']