Metrics

How to implement a new metrics

This guide explains how to implement a new metric to evaluate. For more examples, refer to the wibench.metrics module.

Create your_metric.py file in user_plugins directory.

Metric should return string, int or float value.

Post embed metrics

These kind of metrics should inherit PostEmbedMetric class and implement __call__ method. __call__ should take 3 arguments:

object data from dataset,
marked object,
watermark_data

Post attack metrics

These kind of metrics should inherit PostEmbedMetric class and implement __call__ method. __call__ should take 3 arguments:

marked object,
attacked object,
watermark_data

For example, for image-based metrics:

from wibench.typing import TorchImg

class MyMetric(PostEmbedMetric):
    def __call__(
        self,
        img1: TorchImg,
        img2: TorchImg,
        watermark_data: Any,
    ):

    ...

    return metric_res

Post extract metrics

These metrics should inherit PostExtractMetric class and implement __call__ method. __call__ should take 4 arguments:

object data from dataset,
marked object,
watermark_data,
extraction_result from extract method of an algorithm wrapper

For example, for image-based metrics:

from wibench.typing import TorchImg

class MyMetric(PostEmbedMetric):
    def __call__(
        self,
        img1: TorchImg,
        img2: TorchImg,
        watermark_data: Any,
        extraction_result: Any,
    ):

    ...

    return metric_res

Implemented metrics

PSNR

class wibench.metrics.base.PSNR[source]

Peak Signal-to-Noise Ratio between original and processed images.

Measures pixel-level difference in decibels. Higher values indicate better quality.

Notes

Range: Typically 20-50 dB for images
Infinite if images are identical

SSIM

class wibench.metrics.base.SSIM[source]

Structural Similarity Index Measure between images.

Perceptual metric assessing structural similarity (range 0-1).

Notes

value 1 indicates perfect similarity

BER

class wibench.metrics.base.BER[source]

Bit Error Rate between original and extracted watermarks.

Measures fraction of incorrectly recovered bits.

TPRxFPR

class wibench.metrics.base.TPRxFPR(fpr_rate: float)[source]

True Positive Rate at fixed False Positive Rate threshold.

Robustness metric for watermark detection systems.

Parameters

fpr_ratefloat: Target false positive rate (e.g., 0.01 for 1% FPR)

Notes

Uses binomial distribution for threshold calculation
Caches thresholds for efficiency
Binary classification metric

P-value

class wibench.metrics.base.PValue[source]

P-value of extraction result. P-value denotes probability to observe the same result as in case of extraction from not watermarked object.

Notes

For zero-bit methods we assume that extraction function returns p-value itself.
For multi-bit methods p-value is calculated as probability to get the same number of mismatched bits or less than observed in case of a random message with unified i.i.d. bit values.
Lower p-value stands for more confident “content is watermarked” decision.

LPIPS

class wibench.metrics.lpips.lpips.LPIPS(net: str = 'alex', device: str = 'cpu')[source]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric [paper].

The implementation is taken from the github repository.

Initialization Parameters

netstr
Type of network architecture (default ‘alex’)

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1TorchImg
Input image tensor in (C, H, W) format

img2TorchImg
Input image tensor in (C, H, W) format

watermark_dataAny
Not used, can be anything

Notes

The watermark_data field is required for the pipeline to work correctly

DreamSim

class wibench.metrics.dreamsim.dreamsim.DreamSim(*args, **kwargs)[source]

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data.

The implementation is taken from the github repository.

Initialization Parameters

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1str
Input image tensor in (C, H, W) format

img2TorchImg
Input image tensor in (C, H, W) format

watermark_dataAny
Not used, can be anything

Notes

The watermark_data field is required for the pipeline to work correctly

Aesthetic

class wibench.metrics.aesthetic.aesthetic.Aesthetic(*args, **kwargs)[source]

Aesthetic score predictor based on a simple neural net that takes CLIP embeddings as inputs.

The implementation is taken from the github repository. Based on improved-aesthetic-predictor code base.

Initialization Parameters

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1TorchImg
Input image tensor in (C, H, W) format

img2TorchImg
Input image tensor in (C, H, W) format

watermark_dataAny
Not used, can be anything

Notes

The watermark_data field is required for the pipeline to work correctly

BLIP

class wibench.metrics.blip.blip.BLIP(device: str = 'cpu')[source]

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.

The implementation is taken from the github repository. Based on BLIP code base.

Initialization Parameters

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

promptstr
Text prompt for comparison

img2TorchImg
Input image tensor in (C, H, W) format

watermark_dataAny
Not used, can be anything

Notes

The watermark_data field is required for the pipeline to work correctly

CLIPScore

class wibench.metrics.clip.clip.CLIPScore(*args, **kwargs)[source]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning.

The implementation is taken from the github repository. Based on CLIP code base.

Initialization Parameters

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

promptstr
Text prompt for comparison

img2TorchImg
Input image tensor in (C, H, W) format

watermark_dataAny
Not used, can be anything

Notes

The watermark_data field is required for the pipeline to work correctly

CLIP_IQA

class wibench.metrics.clip_iqa.clip_iqa.CLIP_IQA(prompts: Tuple[Union[str, Tuple[str]]] = ('quality',), device: str = 'cpu')[source]

Exploring CLIP for Assessing the Look and Feel of Images [paper].

The implementation is taken from the repository.

Initialization Parameters

promptsTuple[Union[str, Tuple[str]]]
List of text prompts for assessing the visual quality of an image (default (“quality”,))

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

img1TorchImg
Input image tensor in (C, H, W) format

img2TorchImg
Input image tensor in (C, H, W) format

watermark_dataAny
Not used, can be anything

Notes

The watermark_data field is required for the pipeline to work correctly

ImageReward

class wibench.metrics.image_reward.image_reward.ImageReward(device: str = 'cpu')[source]

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation.

The implementation is taken from the github repository.

Initialization Parameters

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

Call Parameters

promptstr
Text prompt for comparison

img2TorchImg
Input image tensor in (C, H, W) format

watermark_dataAny
Not used, can be anything

Notes

The watermark_data field is required for the pipeline to work correctly

class wibench.metrics.fid.fid.FID(dataset_type: Optional[str] = None, dataset_args: Dict[str, Any] = {'cache_dir': None, 'sample_range': None, 'split': 'val'}, device: str = 'cpu', feature: int = 2048, normalize: bool = True)[source]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium [paper].

The implementation is taken from the repository.

Initialization Parameters

dataset_typeOptional[str]
A dataset of images that will be used as real ones. If not specified, actual images will be added during the pipeline (default None)

dataset_args: Dict[str, Any]
Arguments for the dataset_type dataset (default {“sample_range”: None, “split”: “val”, “cache_val”: None})

devicestr
Device to run the model on (‘cuda’, ‘cpu’)

feature: int
An integer will indicate the inceptionv3 feature layer to choose. Can be one of the following: 64, 192, 768, 2048 (default 2048)

normalize: bool
Argument for controlling the input image dtype normalization (default True)

Metrics

How to implement a new metrics

Post embed metrics

Post attack metrics

Post extract metrics

Implemented metrics

PSNR

Notes

SSIM

Notes

BER

TPRxFPR

Parameters

Notes

P-value

Notes

LPIPS

Initialization Parameters

Call Parameters

Notes

DreamSim

Initialization Parameters

Call Parameters

Notes

Aesthetic

Initialization Parameters

Call Parameters

Notes

BLIP

Initialization Parameters

Call Parameters

Notes

CLIPScore

Initialization Parameters

Call Parameters

Notes

CLIP_IQA

Initialization Parameters

Call Parameters

Notes

ImageReward

Initialization Parameters

Call Parameters

Notes

FID

Initialization Parameters