Metrics

How to implement a new metrics

This guide explains how to implement a new metric to evaluate. For more examples, refer to the wibench.metrics module.

Create your_metric.py file in user_plugins directory.

Metric should return string, int or float value.

Post embed metrics

These kind of metrics should inherit PostEmbedMetric class and implement __call__ method. __call__ should take 3 arguments:

  • object data from dataset,

  • marked object,

  • watermark_data

Post attack metrics

These kind of metrics should inherit PostEmbedMetric class and implement __call__ method. __call__ should take 3 arguments:

  • marked object,

  • attacked object,

  • watermark_data

For example, for image-based metrics:

from wibench.typing import TorchImg

class MyMetric(PostEmbedMetric):
    def __call__(
        self,
        img1: TorchImg,
        img2: TorchImg,
        watermark_data: Any,
    ):

    ...

    return metric_res

Post extract metrics

These metrics should inherit PostExtractMetric class and implement __call__ method. __call__ should take 4 arguments:

  • object data from dataset,

  • marked object,

  • watermark_data,

  • extraction_result from extract method of an algorithm wrapper

For example, for image-based metrics:

from wibench.typing import TorchImg

class MyMetric(PostEmbedMetric):
    def __call__(
        self,
        img1: TorchImg,
        img2: TorchImg,
        watermark_data: Any,
        extraction_result: Any,
    ):

    ...

    return metric_res

Implemented metrics

PSNR

class wibench.metrics.base.PSNR[source]

Peak Signal-to-Noise Ratio between original and processed images.

Measures pixel-level difference in decibels. Higher values indicate better quality.

Notes

  • Range: Typically 20-50 dB for images

  • Infinite if images are identical

SSIM

class wibench.metrics.base.SSIM[source]

Structural Similarity Index Measure between images.

Perceptual metric assessing structural similarity (range 0-1).

Notes

  • value 1 indicates perfect similarity

BER

class wibench.metrics.base.BER[source]

Bit Error Rate between original and extracted watermarks.

Measures fraction of incorrectly recovered bits.

TPRxFPR

class wibench.metrics.base.TPRxFPR(fpr_rate: float)[source]

True Positive Rate at fixed False Positive Rate threshold.

Robustness metric for watermark detection systems.

Parameters

fpr_ratefloat

Target false positive rate (e.g., 0.01 for 1% FPR)

Notes

  • Uses binomial distribution for threshold calculation

  • Caches thresholds for efficiency

  • Binary classification metric

P-value

class wibench.metrics.base.PValue[source]

P-value of extraction result. P-value denotes probability to observe the same result as in case of extraction from not watermarked object.

Notes

  • For zero-bit methods we assume that extraction function returns p-value itself.

  • For multi-bit methods p-value is calculated as probability to get the same number of mismatched bits or less than observed in case of a random message with unified i.i.d. bit values.

  • Lower p-value stands for more confident “content is watermarked” decision.

LPIPS

class wibench.metrics.lpips.lpips.LPIPS(net: str = 'alex', device: str = 'cpu')[source]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric [paper].

The implementation is taken from the github repository.

Initialization Parameters

netstr

Type of network architecture (default ‘alex’)

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1TorchImg

Input image tensor in (C, H, W) format

img2TorchImg

Input image tensor in (C, H, W) format

watermark_dataAny

Not used, can be anything

Notes

  • The watermark_data field is required for the pipeline to work correctly

DreamSim

class wibench.metrics.dreamsim.dreamsim.DreamSim(*args, **kwargs)[source]

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data.

The implementation is taken from the github repository.

Initialization Parameters

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1str

Input image tensor in (C, H, W) format

img2TorchImg

Input image tensor in (C, H, W) format

watermark_dataAny

Not used, can be anything

Notes

  • The watermark_data field is required for the pipeline to work correctly

Aesthetic

class wibench.metrics.aesthetic.aesthetic.Aesthetic(*args, **kwargs)[source]

Aesthetic score predictor based on a simple neural net that takes CLIP embeddings as inputs.

The implementation is taken from the github repository. Based on improved-aesthetic-predictor code base.

Initialization Parameters

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1TorchImg

Input image tensor in (C, H, W) format

img2TorchImg

Input image tensor in (C, H, W) format

watermark_dataAny

Not used, can be anything

Notes

  • The watermark_data field is required for the pipeline to work correctly

BLIP

class wibench.metrics.blip.blip.BLIP(device: str = 'cpu')[source]

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

The implementation is taken from the github repository. Based on BLIP code base.

Initialization Parameters

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

promptstr

Text prompt for comparison

img2TorchImg

Input image tensor in (C, H, W) format

watermark_dataAny

Not used, can be anything

Notes

  • The watermark_data field is required for the pipeline to work correctly

CLIPScore

class wibench.metrics.clip.clip.CLIPScore(*args, **kwargs)[source]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning.

The implementation is taken from the github repository. Based on CLIP code base.

Initialization Parameters

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

promptstr

Text prompt for comparison

img2TorchImg

Input image tensor in (C, H, W) format

watermark_dataAny

Not used, can be anything

Notes

  • The watermark_data field is required for the pipeline to work correctly

CLIP_IQA

class wibench.metrics.clip_iqa.clip_iqa.CLIP_IQA(prompts: Tuple[Union[str, Tuple[str]]] = ('quality',), device: str = 'cpu')[source]

Exploring CLIP for Assessing the Look and Feel of Images [paper].

The implementation is taken from the repository.

Initialization Parameters

promptsTuple[Union[str, Tuple[str]]]

List of text prompts for assessing the visual quality of an image (default (“quality”,))

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1TorchImg

Input image tensor in (C, H, W) format

img2TorchImg

Input image tensor in (C, H, W) format

watermark_dataAny

Not used, can be anything

Notes

  • The watermark_data field is required for the pipeline to work correctly

ImageReward

class wibench.metrics.image_reward.image_reward.ImageReward(device: str = 'cpu')[source]

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation.

The implementation is taken from the github repository.

Initialization Parameters

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

promptstr

Text prompt for comparison

img2TorchImg

Input image tensor in (C, H, W) format

watermark_dataAny

Not used, can be anything

Notes

  • The watermark_data field is required for the pipeline to work correctly

FID

class wibench.metrics.fid.fid.FID(dataset_type: Optional[str] = None, dataset_args: Dict[str, Any] = {'cache_dir': None, 'sample_range': None, 'split': 'val'}, device: str = 'cpu', feature: int = 2048, normalize: bool = True)[source]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium [paper].

The implementation is taken from the repository.

Initialization Parameters

dataset_typeOptional[str]

A dataset of images that will be used as real ones. If not specified, actual images will be added during the pipeline (default None)

dataset_args: Dict[str, Any]

Arguments for the dataset_type dataset (default {“sample_range”: None, “split”: “val”, “cache_val”: None})

devicestr

Device to run the model on (‘cuda’, ‘cpu’)

feature: int

An integer will indicate the inceptionv3 feature layer to choose. Can be one of the following: 64, 192, 768, 2048 (default 2048)

normalize: bool

Argument for controlling the input image dtype normalization (default True)