Datasets

How to add a new dataset

This guide explains how to add a new dataset to WIBE framework. For more examples, refer to the wibench.datasets module.

Create your_dataset.py file in user_plugins directory.

Here we have an example for image based dataset.

from wibench.datasets import BaseDataset
from wibench.typing import ImageObject

class MyDataset(BaseDataset):

    def __init__(self, parametrs_of_dataset):
        ...
        # Any initialization dataset may need

    def __len__(self) -> int:
        # Length of dataset if available for progress bar.

    def generator(self) -> Generator[ImageObject, None, None]:
            # Yields images from directory.
            ...
            yield ImageObject(image_id, torch_image)

If it is possible to get number of samples in dataset, you may inherit from RangeBaseDataset.

from wibench.datasets import RangeBaseDataset
from wibench.typing import ImageObject

class MyDataset(BaseDataset):

    def __init__(self, parametrs_of_dataset, sample_range: Optional[Tuple[int, int]] = None):
        ...
        super().__init__(sample_range, self.__len__())
        ...

    def __len__(self) -> int:
        ...

    def generator(self) -> Generator[ImageObject, None, None]:
            # Yields images from directory.
            ...
            yield ImageObject(image_id, torch_image)

Implemented datasets

class wibench.datasets.base.ImageFolderDataset(path: Union[Path, str], preload: bool = False, img_ext: List[str] = ['png', 'jpg'], sample_range: Optional[Tuple[int, int]] = None)[source]

Concrete dataset implementation loading images from a directory.

Supports common image formats with optional preloading.

Parameters

pathUnion[Path, str]

Directory path containing images

preloadbool

Whether to load all images into memory upfront

img_extList[str]

Image file extensions to include (default: [‘png’, ‘jpg’])

sample_rangeOptional[Tuple[int, int]]

Optional (start, end) index range to subset the dataset (including both borders)

class wibench.datasets.base.PromptFolderDataset(path: Union[Path, str], prompt_ext: List[str] = ['txt', 'csv'], sample_range: Optional[Tuple[int, int]] = None, separator: str = '\n')[source]
Concrete dataset implementation loading prompts from a directory.

Directory should contain a number of “.txt” or “.csv” files with prompts, in one file prompts are separated by separator. All prompts are preloaded.

pathUnion[Path, str]

Directory path containing images

prompt_extList[str]

File extensions to include (default: [‘txt’, ‘csv’])

sample_rangeOptional[Tuple[int, int]]

Optional (start, end) index range to subset the dataset (including both borders). Default: None (full dataset)

separatorstr

Separator for prompts in one file, default “

class wibench.datasets.diffusiondb.diffusiondb.DiffusionDB(subset: str = '2m_first_5k', sample_range: Optional[Tuple[int, int]] = None, cache_dir: Optional[str] = None, skip_nsfw: bool = True, return_prompt: bool = False)[source]

Dataset loader for the DiffusionDB large-scale text-to-image dataset.

Provides access to generated images and their prompts from DiffusionDB, with optional NSFW filtering and prompt-only retrieval modes.

Parameters

subsetstr

Dataset subset name (e.g., ‘2m_first_5k’)

sample_rangeOptional[Tuple[int, int]]

Optional (start, end) index range to subset the dataset

cache_dirOptional[str]

Directory to cache downloaded dataset files

skip_nsfwbool

Whether to automatically filter out NSFW images (default True)

return_promptbool

Whether to return prompts instead of images (default False)

class wibench.datasets.mscoco.mscoco.MSCOCO(split: str = 'val', sample_range: Optional[Tuple[int, int]] = None, cache_dir: Optional[str] = None)[source]

Dataset loader for MS-COCO (Common Objects in Context) images.

Provides access to the COCO 2017 dataset images through HuggingFace Datasets, supporting both validation and training splits with optional caching.

Parameters

splitstr

Dataset split to load (‘train’ or ‘val’)

sample_rangeOptional[Tuple[int, int]]

Optional (start, end) index range to subset the dataset

cache_dirOptional[str]

Directory to cache downloaded dataset files