Datasets
How to add a new dataset
This guide explains how to add a new dataset to WIBE framework. For more examples, refer to the wibench.datasets module.
Create your_dataset.py file in user_plugins directory.
Here we have an example for image based dataset.
from wibench.datasets import BaseDataset
from wibench.typing import ImageObject
class MyDataset(BaseDataset):
def __init__(self, parametrs_of_dataset):
...
# Any initialization dataset may need
def __len__(self) -> int:
# Length of dataset if available for progress bar.
def generator(self) -> Generator[ImageObject, None, None]:
# Yields images from directory.
...
yield ImageObject(image_id, torch_image)
If it is possible to get number of samples in dataset, you may inherit from RangeBaseDataset.
from wibench.datasets import RangeBaseDataset
from wibench.typing import ImageObject
class MyDataset(BaseDataset):
def __init__(self, parametrs_of_dataset, sample_range: Optional[Tuple[int, int]] = None):
...
super().__init__(sample_range, self.__len__())
...
def __len__(self) -> int:
...
def generator(self) -> Generator[ImageObject, None, None]:
# Yields images from directory.
...
yield ImageObject(image_id, torch_image)
Implemented datasets
- class wibench.datasets.base.ImageFolderDataset(path: Union[Path, str], preload: bool = False, img_ext: List[str] = ['png', 'jpg'], sample_range: Optional[Tuple[int, int]] = None)[source]
Concrete dataset implementation loading images from a directory.
Supports common image formats with optional preloading.
Parameters
- pathUnion[Path, str]
Directory path containing images
- preloadbool
Whether to load all images into memory upfront
- img_extList[str]
Image file extensions to include (default: [‘png’, ‘jpg’])
- sample_rangeOptional[Tuple[int, int]]
Optional (start, end) index range to subset the dataset (including both borders)
- class wibench.datasets.base.PromptFolderDataset(path: Union[Path, str], prompt_ext: List[str] = ['txt', 'csv'], sample_range: Optional[Tuple[int, int]] = None, separator: str = '\n')[source]
- Concrete dataset implementation loading prompts from a directory.
Directory should contain a number of “.txt” or “.csv” files with prompts, in one file prompts are separated by separator. All prompts are preloaded.
- pathUnion[Path, str]
Directory path containing images
- prompt_extList[str]
File extensions to include (default: [‘txt’, ‘csv’])
- sample_rangeOptional[Tuple[int, int]]
Optional (start, end) index range to subset the dataset (including both borders). Default: None (full dataset)
- separatorstr
Separator for prompts in one file, default “
“
- class wibench.datasets.diffusiondb.diffusiondb.DiffusionDB(subset: str = '2m_first_5k', sample_range: Optional[Tuple[int, int]] = None, cache_dir: Optional[str] = None, skip_nsfw: bool = True, return_prompt: bool = False)[source]
Dataset loader for the DiffusionDB large-scale text-to-image dataset.
Provides access to generated images and their prompts from DiffusionDB, with optional NSFW filtering and prompt-only retrieval modes.
Parameters
- subsetstr
Dataset subset name (e.g., ‘2m_first_5k’)
- sample_rangeOptional[Tuple[int, int]]
Optional (start, end) index range to subset the dataset
- cache_dirOptional[str]
Directory to cache downloaded dataset files
- skip_nsfwbool
Whether to automatically filter out NSFW images (default True)
- return_promptbool
Whether to return prompts instead of images (default False)
- class wibench.datasets.mscoco.mscoco.MSCOCO(split: str = 'val', sample_range: Optional[Tuple[int, int]] = None, cache_dir: Optional[str] = None)[source]
Dataset loader for MS-COCO (Common Objects in Context) images.
Provides access to the COCO 2017 dataset images through HuggingFace Datasets, supporting both validation and training splits with optional caching.
Parameters
- splitstr
Dataset split to load (‘train’ or ‘val’)
- sample_rangeOptional[Tuple[int, int]]
Optional (start, end) index range to subset the dataset
- cache_dirOptional[str]
Directory to cache downloaded dataset files