In this paper we propose a novel modification of CLIP guidance for the task of backlit image enhancement.
Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space.
Learned prompts then guide an image enhancement network.
Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance.
First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality.
This accelerates training and potentially enables the use of additional encoders that do not have a text encoder.
Second, we propose a novel approach that does not require any prompt tuning.
Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images.
This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images.
This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts.
Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.
In RAVE we exploit arithmetic defined in the CLIP latent space. Using well-lit and backlit training data, we construct a residual vector, which will then be used for enhancement model guidance. This is a vector that points in a direction moving from backlit images to well-lit images in the CLIP embedding space. We then use this vector as guidance for the image enhancement model during training. This will train the image enhancement model to produce images with CLIP latent vectors that are close to the CLIP latent vectors of well-lit training images.
We report quantiative results for our methods trained in paired and unpaired training data setups. In the case of paired data each backlit training image has a corresponding well-lit image in the training data. In the case of unpaired data backlit and well-lit training images might be of completely different semantics. Results of our methods show that RAVE achieves state-of-the-art performance in both of these settings.
@article{gaintseva2024raveresidualvectorembedding,
title={RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement},
author={Tatiana Gaintseva and Martin Benning and Gregory Slabaugh},
year={2024},
eprint={2404.01889},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2404.01889},
}