Datasets

How to add a new dataset

This guide explains how to add a new dataset to WIBE framework. For more examples, refer to the wibench.datasets module.

Create your_dataset.py file in user_plugins directory.

Here we have an example for image based dataset.

from wibench.datasets import BaseDataset
from wibench.typing import ImageObject

class MyDataset(BaseDataset):

    def __init__(self, parametrs_of_dataset):
        ...
        # Any initialization dataset may need

    def __len__(self) -> int:
        # Length of dataset if available for progress bar.

    def generator(self) -> Generator[ImageObject, None, None]:
            # Yields images from directory.
            ...
            yield ImageObject(image_id, torch_image)

If it is possible to get number of samples in dataset, you may inherit from RangeBaseDataset.

from wibench.datasets import RangeBaseDataset
from wibench.typing import ImageObject

class MyDataset(BaseDataset):

    def __init__(self, parametrs_of_dataset, sample_range: Optional[Tuple[int, int]] = None):
        ...
        super().__init__(sample_range, self.__len__())
        ...

    def __len__(self) -> int:
        ...

    def generator(self) -> Generator[ImageObject, None, None]:
            # Yields images from directory.
            ...
            yield ImageObject(image_id, torch_image)

Implemented datasets

class wibench.datasets.base.ImageFolderDataset(path: Union[Path, str], preload: bool = False, img_ext: List[str] = ['png', 'jpg'], sample_range: Optional[Tuple[int, int]] = None)[source]

Concrete dataset implementation loading images from a directory.

Supports common image formats with optional preloading.

Parameters

pathUnion[Path, str]: Directory path containing images
preloadbool: Whether to load all images into memory upfront
img_extList[str]: Image file extensions to include (default: [‘png’, ‘jpg’])
sample_rangeOptional[Tuple[int, int]]: Optional (start, end) index range to subset the dataset (including both borders)

class wibench.datasets.base.PromptFolderDataset(path: Union[Path, str], prompt_ext: List[str] = ['txt', 'csv'], sample_range: Optional[Tuple[int, int]] = None, separator: str = '\n')[source]

Concrete dataset implementation loading prompts from a directory.

Directory should contain a number of “.txt” or “.csv” files with prompts, in one file prompts are separated by separator. All prompts are preloaded.

pathUnion[Path, str]: Directory path containing images
prompt_extList[str]: File extensions to include (default: [‘txt’, ‘csv’])
sample_rangeOptional[Tuple[int, int]]: Optional (start, end) index range to subset the dataset (including both borders). Default: None (full dataset)
separatorstr: Separator for prompts in one file, default “

“

class wibench.datasets.diffusiondb.diffusiondb.DiffusionDB(subset: str = '2m_first_5k', sample_range: Optional[Tuple[int, int]] = None, cache_dir: Optional[str] = None, skip_nsfw: bool = True, return_prompt: bool = False)[source]

Dataset loader for the DiffusionDB large-scale text-to-image dataset.

Provides access to generated images and their prompts from DiffusionDB, with optional NSFW filtering and prompt-only retrieval modes.

Parameters

subsetstr: Dataset subset name (e.g., ‘2m_first_5k’)
sample_rangeOptional[Tuple[int, int]]: Optional (start, end) index range to subset the dataset
cache_dirOptional[str]: Directory to cache downloaded dataset files
skip_nsfwbool: Whether to automatically filter out NSFW images (default True)
return_promptbool: Whether to return prompts instead of images (default False)

class wibench.datasets.mscoco.mscoco.MSCOCO(split: str = 'val', sample_range: Optional[Tuple[int, int]] = None, cache_dir: Optional[str] = None)[source]

Dataset loader for MS-COCO (Common Objects in Context) images.

Provides access to the COCO 2017 dataset images through HuggingFace Datasets, supporting both validation and training splits with optional caching.

Parameters

splitstr: Dataset split to load (‘train’ or ‘val’)
sample_rangeOptional[Tuple[int, int]]: Optional (start, end) index range to subset the dataset
cache_dirOptional[str]: Directory to cache downloaded dataset files