RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

In this paper we propose a novel modification of CLIP guidance for the task of backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance.
First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder.
Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts.
Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

Method

In RAVE we exploit arithmetic defined in the CLIP latent space. Using well-lit and backlit training data, we construct a residual vector, which will then be used for enhancement model guidance. This is a vector that points in a direction moving from backlit images to well-lit images in the CLIP embedding space. We then use this vector as guidance for the image enhancement model during training. This will train the image enhancement model to produce images with CLIP latent vectors that are close to the CLIP latent vectors of well-lit training images.

Overview of the RAVE model. (a) First, we calculate residual vector vresidual based on backlit and well-lit training data. (b) Then we switch to enhancement model training based on the identity loss and the loss based on the residual vector.

Results

We report quantiative results for our methods trained in paired and unpaired training data setups. In the case of paired data each backlit training image has a corresponding well-lit image in the training data. In the case of unpaired data backlit and well-lit training images might be of completely different semantics. Results of our methods show that RAVE achieves state-of-the-art performance in both of these settings.

Quantitative comparison of different methods on the BAID test dataset. The best and second best performances in both settings are in bold and underlined.

More examples

Citation


              @article{gaintseva2024raveresidualvectorembedding, 

                      title={RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement}, 

                      author={Tatiana Gaintseva and Martin Benning and Gregory Slabaugh}, 

                      year={2024}, 

                      eprint={2404.01889}, 

                      archivePrefix={arXiv}, 

                      primaryClass={cs.CV}, 

                      url={https://arxiv.org/abs/2404.01889}, 

                }

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

Tatiana Gaintseva^*

Martin Benning

Gregory Slabaugh

Abstract

Method

Results

More examples

Citation

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

Tatiana Gaintseva*

Martin Benning

Gregory Slabaugh

Abstract

Method

Results

More examples

Citation

Tatiana Gaintseva^*