NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (2024)

Jie Liang Radu Timofte Qiaosi Yi Shuaizheng Liu Lingchen Sun Rongyuan Wu Xindong Zhang Hui Zeng Lei Zhang Yibin Huang Shuai Liu Yongqiang Li Chaoyu Feng Xiaotao Wang Lei Lei Yuxiang Chen Xiangyu Chen Qiubo Chen Fengyu Sun Mengying Cui Jiaxu Chen Zhenyu Hu Jingyun Liu Wenzhuo Ma Ce Wang Hanyou Zheng Wanjie Sun Zhenzhong Chen Ziwei Luo Fredrik K. Gustafsson Zheng Zhao Jens Sjölund Thomas B. Schön Xiong Dun Pengzhou Ji Yujie Xing Xuquan Wang Zhanshan Wang Xinbin Cheng Jun Xiao Chenhang He Xiuyuan Wang Zhi-Song Liu Zimeng Miao Zhicun Yin Ming Liu Wangmeng Zuo Shuai Li

Abstract

In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where generative perceptual quality and fidelity are desired in the restoration result. The challenge consisted of two tasks. Task one employed real referenced data pairs, where quantitative evaluation is available. Task two used unpaired images, and a comprehensive user study was conducted. The challenge attracted more than 200 registrations, where 39 of them submitted results with more than 400 submissions. Top-ranked methods improved the state-of-the-art restoration performance and obtained unanimous recognition from all 18 judges. The proposed datasets are available at https://drive.google.com/file/d/1DqbxUoiUqkAIkExu3jZAqoElr_nu1IXb/view?usp=sharing and the homepage of this challenge is at https://codalab.lisn.upsaclay.fr/competitions/17632.

1 Introduction

Image restoration, aiming at recovering high-quality images from their low-quality counterparts, is one of the most popular low-level vision tasks in the research community. However, there has been a large gap between Academic research and Industrial application for a long time. For example, the image signal processing (ISP) systems on digital cameras always face mixed and complex degradations, yet most methods in academic research are designed and evaluated based on simulated and limited degradation. How to design and train a model that can be generalized to practical applications is a challenging yet highly valuable problem.

The deep learning techniques have significantly advanced the performance of image restoration. Recently, generative adversarial networks show good performance in approximating distributions of real photos in image restoration tasks, while the large-scale pre-trained generative diffusion models have provided powerful priors to further improve the quality of image restoration outputs.

This challenge aims to provide a platform for industrial and academic participants to test and evaluate their algorithms and models on real-world imaging scenarios, bridging the gap between academic research and practical photography. The objectives of this RAIM challenge are:

•
Construct a benchmark for image restoration in the wild, including real-world images with/without reference ground-truth in various scenarios and objective/subjective evaluation methods;
•
Promote the research and development of RAIMs with strong generalization performance to images in the wild.

^†^†footnotetext: Jie Liang, Radu Timofte, Qiaosi Yi, Shuaizheng Liu, Lingchen Sun, Rongyuan Wu, Xindong Zhang, Hui Zeng and Lei Zhang are the organizers of the NTIRE 2024 challenge, and other authors are the participants.^†^†footnotetext: The Appendix lists the authors’ teams and affiliations.^†^†footnotetext: NTIRE 2024 website: https://cvlai.net/ntire/2024/

This challenge is one of the NTIRE 2024 Workshop associated challenges on: dense and non-hom*ogeneous dehazing[3], night photography rendering[5], blind compressed image enhancement[47], shadow removal[35], efficient super resolution[30], image super resolution ( $\times$ 4)[9], light field image super-resolution[42], stereo image super-resolution[38], HR depth from images of specular and transparent surfaces[50], bracketing image restoration and enhancement[58], portrait quality assessment[7], quality assessment for AI-generated content[20], restore any image model (RAIM) in the wild[18], RAW image super-resolution[11], short-form UGC video quality assessment[16], low light enhancement[21], and RAW burst alignment and ISP challenge.

2 NTIRE 2024 RAIM Challenge

2.1 Training Data

In this challenge, participants can train their models using any data they can collect and any pre-trained models they can reach.

2.2 Validation and Test Data

To facilitate the design and development of RAIM by participants, we provide two types of validation and test data: paired data with reference ground truth (R-GT), and unpaired data. All data is available now at https://drive.google.com/file/d/1DqbxUoiUqkAIkExu3jZAqoElr_nu1IXb/view?usp=sharing.

2.2.1 Paired Data with R-GT

To facilitate the model validation, we first provide some paired data in the following scenarios, where both the input low-quality image and the high-quality R-GT can be collected. Examples can be found in Figure1.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (1)

Image denoising. Compromising with the size and cost, the photosensitivity of imaging sensors, especially on mobile phone products, is limited. Meanwhile, the illumination of shooting scenes can be poor, especially in low-light imaging. Image denoising is a very fundamental requirement for image restoration.

Image super-resolution. The focal length of mobile phone cameras is limited, making it hard to meet the needs of continuous and ultra-long magnification zoom. Therefore, mobile cameras will be equipped with digital zoom algorithms, namely super-resolution algorithms.Out-of-focus restoration. The autofocus (AF) algorithm cannot guarantee 100% focus accuracy. In fleeting moments of excitement, such as blowing birthday candles, fireworks, etc, the image restoration algorithms are expected to remedy slight defocusing shots.

Motion deblur. Due to the limitations in aperture size and sensor capability, mobile phone cameras face a trade-off between shutter time and motion blur. A longer shutter may enhance the noise reduction performance, but it is prone to motion blur when encountering foreground object motion or handheld motion. An effective motion deblur algorithm is demanded.

The combinations of the above. When capturing a real-world photo in the wild, the above issues are usually triggered simultaneously by several factors, such as motion blur in high magnification super-resolution, out-of-focus in low light environments, etc. When multiple problems appear in the image, a strong model is needed to solve them jointly.In this challenge, we use these data to calculate the full-reference metrics to partially measure the effectiveness of the algorithms and screen the top performers in the early stage.

2.2.2 Data without R-GT

In many practical scenarios, the R-GT is very difficult to collect, and the image restoration performance is hard, if not possible, to be measured by full-reference metrics. In this challenge, we also provide the data with the following commonly encountered issues in practice. Examples can be found in Figure2.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (2)

Smoothed details and textures. Limited by hardware and on-chip computing power, images captured by mobile phone cameras often face a trade-off between noise/artifacts reduction and details/texture richness, impacting the visual quality. However, due to the lack of effective quantitative measures, evaluation can only be done through subjective observation.

Text stroke adhesion in super-resolution. In the telephoto mode (e.g., equivalent focal length larger than 230mm), shooting small text from a distance is an important yet highly challenging task. Text stroke adhesion, or super-resolution errors (i.e., presenting wrong characters), will greatly deteriorate the user experience.

High light edge and color artifacts. The optical system of mobile phones is limited, and prone to purple edges, green edges, halos, and fake textures in high-light areas. This problem occurs frequently in reflective scenes, backlight scenes, night scenes, etc., which greatly affects the user’s perception.

Low-frequency color noise/blocks/bandings. The low SNR of the input in mobile phone cameras demands a heavy denoising algorithm to output a clean image. However, due to factors such as computing power and storage, the bit-width of the ISP system is limited. When transitioning from the linear domain to the nonlinear domain, visual color noise, blocks and bandings often appear.

High-frequency aliasing and Moire pattern. Due to the resolution of imaging sensors, the Moire pattern can appear at specific distances and frequencies. Although users have certain expectations (or understanding) about the appearance of Moire patterns, they still hope to reduce the probability and severity of Moire patterns without affecting the clarity of the image.

2.3 Evaluation Measures

We evaluate the effectiveness of the models with both quantitative measures and subjective evaluation.

2.3.1 Quantitative Measure

Following prior arts, we employed the PSNR, SSIM, LPIPS, DISTS and NIQE measures to evaluate the models quantitatively by using the data with R-GT. The evaluation score is computed as follows^†^†The script of this measure is available at https://drive.google.com/file/d/1Q1CvlbGo-WOgqya5GulS5eYIi2Rgcj5l/view?usp=sharing.:

$SCORE=20\times\frac{PSNR}{50}+15\times\frac{SSIM-0.5}{0.5}+20\times\frac{1-%LPIPS}{0.4}+40\times\frac{1-DISTS}{0.3}+30\times\frac{1-NIQE}{10}$ .

2.3.2 Subjective Evaluation

For the test data without R-GT, we judged the perceptual quality of the restored results by visual inspection. Specifically, we invited 18 experienced practitioners and conducted a comprehensive user study. The following features were considered in the evaluation:

Textures and details. The restored image should have fine and natural textures and details.

Noise. Noise, especially color noise, should be eliminated. Some luminance noise can be kept to avoid over-smoothness in flat areas.

Artifacts. Various artifacts, such as worm-like artifacts, color blocks, bandings, over-sharpening, and so on, should be reduced as much as possible.

Fidelity. The restored image should be loyal to the given input.More details have been discussed during the competition with all participants by referencing specific images and model outputs.

2.4 Phases

2.4.1 Phase 1: Model Design and Tuning

In this phase, participants can analyze the given data and tune their models accordingly. We provided:

•
100 pairs of paired data (i.e., input with R-GT), which can be used to tune the models based on the quantitative measures.
•
100 images without R-GT, which can be used to tune the model according to visual perception.

2.4.2 Phase 2: Online Feedback

In this phase, participants can upload their results and get official feedback. We provide:

•
the input low-quality images of another 100 pairs of paired data.

Only the low-quality input images are provided, and the participants can upload the restoration results to the server and get the quantitative scores online. Users can also upload their results of the images without the R-GT provided in Phase 1 to seek feedback. The organizers will provide feedback to a couple of teams that get the highest quantitative scores of the images with R-GT.

2.4.3 Phase 3: Final Evaluation

In this phase, we provide:

•
another 50 images without R-GT for subjective evaluation.

In this phase, we select the top ten teams according to the quantitative score of the 100 images with R-GT in Phase 2, and then arrange a comprehensive user study on their results of the above 50 images without R-GT. The final ranks of the ten teams will be decided based on both the quantitative scores and the subjective user study, with the weight being 40% and 60%, respectively.

2.5 Awards

The following awards of this challenge are provided:

•
One first-class award (i.e., the champion) with a cash prize of US$1000;
•
Two second-class awards with cash prizes of US$500 each;
•
Three third-class awards with cash prizes of US$200 each.

2.6 Important Dates

•
2024.02.07: Released data of phase 1. Phase 1 began;
•
2024.02.25: Released data of phase 2. Phase 2 began;
•
2024.03.17: Released data of phase 3. Phase 3 began;
•
2024.03.22: Phase 3 results submission deadline;
•
2024.03.27: Final rank announced.

3 Challenge Results

In total, the challenge received 200+ registrations, where 39 of them have submitted results in phase 2 with 400+ submissions. In phase 3, we invited the top 12 teams in phase 2 and received 9 valid submissions. Brief illustrations of the methods from participating teams are provided in Section4, while the team information is provided in Section6.

3.1 Phase 2: quantitative comparison on paired data with R-GT

In phase 2, we got submissions from 30+ teams, where the quantitative results of top-ranked teams are shown in Table1. The evaluation measure is described in Section 2.3.1.

Team	Score in Phase 2	Score in Phase 3	Final Score	Rank
MiAlgo	79.13	57	91.65	1
Xhs-IAG	81.96	47	82.07	2
So Elegant	79.69	46	80.09	3
IIP_IR	80.03	14	45.94	4
DACLIP-IR	78.65	9	40.03	5
TongJi-IPOE	72.99	11	39.91	6
ImagePhoneix	78.93	4	34.79	7
HIT-IIL	69.80	1	27.92	8

3.2 Phase 3: qualitative comparison on unpaired data

In stage three, we invite 18 low-level vision-related students/engineers, who are required to select the top three results of each of the 50 samples. They follow a unified principle as demonstrated in Section2.3.2 and the feedback to individual participants. The team information is hidden and the results are randomly shuffled to make fair comparisons. By checking the results of each scorer, we found their opinions are similar so the results are valid. The final score $S_{final}$ is calculated by

S_{final}=0.4\times S_{2}+0.6\times S_{3}^{n},

(1)

where $S_{2}$ indicates the score in phase 2 and $S_{3}^{n}$ denotes the normalized score in phase 3.

For calculating $S_{3}^{n}$ , we first calculate the score in phase 3, i.e., $S_{3}$ , where the team is rewarded with 3 points when selected to be top 1, 2 points for the top 2, and 1 point for the top 3. The scores are averaged by 18. Then, we calculate $S_{3}^{n}$ by

{S_{3}^{n}}_{teami}=\frac{S_{3_{teami}}-min(S_{3})}{max(S_{3})-min(S_{3})}.

(2)

We then show some example visual comparisons in Figures3,4,5 and6. All visual results in phase 3 are available at https://drive.google.com/file/d/1_vxF2s-WRm59F8Vn1nquE7q4R2zHZTmm/view?usp=sharing.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (3)

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (4)

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (5)

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (6)

4 Teams and Methods

In this section, we briefly describe the participating teams and their proposed methods.

4.1 Team MiAlgo

Team MiAlgo proposed a Wavelet UNet with a Hybrid Transformer and CNN model optimized by adversarial training to tackle the real-world image restoration task.

4.1.1 Generator model

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (7)

As shown in Fig. 7, the model is based on the MWRCAN [13]. The model uses a UNet architecture that employs Harr wavelet transforms and inverse transforms for 2 $\times$ downsampling and upsampling. The major convolution modules consist of $N$ Resblocks, where $N$ is 8 in this case. The channels of the Resblocks are marked in the diagram, and there is also a residual connection in each downsample or upsample block, they omitted these connections for the sake of diagrammatical clarity.

Self-attention in transformers enables the network to identify self-similar features throughout the entire image, thereby enhancing its semantic recognition capabilities. However, the attention structure becomes increasingly time-consuming as the feature size grows, rendering it impractical for high-resolution image restoration tasks. To strike a balance between performance and efficiency, the team integrated RESATT structures into the middle block of UNet. RESATT comprises $N$ basic blocks, each consisting of a res-block followed by a single-head self-attention block.

The UNET produces a 3-channel image called out1. To enhance the quality of the restored image, they incorporate a refinement module based on the EMVD[25] approach. This module helps to recover important details that may have been lost during the restoration process. The refinement module takes in the LR image and out1 as inputs and produces a single-channel fusion weight, denoted by $\alpha$ . The final output image is obtained by blending the LR image and out1 using $\alpha$ , i.e., $HR=(1-\alpha)LR+\alpha out1$ . The refinement module is lightweight, comprising only three convolutional layers with a maximum of 16 channels. Despite its simplicity, it is capable of capturing details that are crucial for the final output image.

4.1.2 Image degradation

The official competition only provided 100 pairs of training data, as well as 200 images without ground truth in the validation/test phase. They found that the degradation level of the provided 100 pairs of training data is only consistent with 100 images in phase 2, which is relatively mild. The other images in Phase 2 and Phase 3 have a heavier blurring.

Based on the analysis presented, the team developed two GAN degradation models that introduce varying levels of blurring. They enlarge the generator in [28] by doubling the channels, as the degradation model. The first model was trained with the ESRGAN [40] training method and consisted of 100 pairs of training sets, with high-resolution images serving as input to the degradation GAN model and low-resolution images as ground truth. This model introduced a weak level of blurring.

For the second model, they fine-tuned the weak degradation model using the approach outlined in Ref [28]. They trained this model in an unpaired manner, using 50 high-blurring images from phase 2 as unpaired GT and 1000 high-resolution input images from similar scenes as unpaired input. This model introduced a higher level of blurring compared to the first model. When using the second degradation model, they utilize a human segmentation model and a text segmentation model to segment out the human images with heights <300 pixels and the text with heights <50 pixels. These segments are then replaced with the degradation results from the first degradation model. This strategy helps to reduce the gap between the input and ground truth for small human images and text, and the team has found that this trick improves the fidelity of the results in these regions.

4.1.3 GAN training

The team has an internal ultra-high-definition dataset consisting of approximately 10,000 images. The main scenes include common animals and plants, Chinese and English text, as well as some common urban and rural scenes, which can cover the typical shooting scenarios of mobile phones. They used the two aforementioned degraded GAN models to degrade these images, resulting in a dataset of 20,000 training pairs.

To develop a high-quality image restoration model for phase 2 quantitative measures, they utilized a GAN model trained on 10,000 degraded training pairs from the initial degradation model. The Generator’s learning rate was set to 1e-5, with a batch size of 24 and a patch size of 512. The team began training with only L2 loss for $\sim$ 10,000 iterations, then fixed the loss to include $L2+1*PerceptualLoss+0.1*GANLoss$ for an additional 140,000 iterations. They then fine-tuned the model for $\sim$ 20,000 iterations with $L2+0.1*PerceptualLoss+0.01*GANLoss+4*LPIPS$ and a lower learning rate of 1e-6 on the official training set (100 pairs) to achieve a slightly higher quantitative score. The discriminator setting is the same as RealESRGAN [39].

For phase 3, the team continued fine-tuning the model for approximately 100,000 iterations using $loss=L2+0.1*PerceptualLoss+0.01*GANLoss+4*LPIPS$ , with a learning rate of $1e-5$ . They used a mixed dataset with $80\%$ strong degradation and $20\%$ weak degradation by adjusting the training file list ratio. Finally, they crop each training image into $512\times 512$ patches and select the top 10 patches with the higher NIQE score for each image. They continued fine-tuning the model on this subset with a learning rate of $1e-6$ for $\sim$ 50,000 iterations. Higher NIQE patches generally have richer textures and they found that fine-tuning the model on this subset resulted in better image details.

4.2 Team Xhs-IAG

Team Xhs-IAG proposed method by combining SUPIR and DeSRRA, which achieves good generative performance and simultaneously acceptable stability on fidelity.

4.2.1 Detailed Method Description for Phase2

⬇

1 window_size = 32,

2 embed_dim=180,

3 depths=(6, 6, 6, 6, 6, 6),

4 num_heads=(6, 6, 6, 6, 6, 6),

5 mlp_ratio=4.,

The dataset they used is LSDIR[17].During training, they construct pairs with a resolution of 128x128. The degradation hyperparameters are the same as those for real-esrgan.They trained 92k iterations with batch size=12 (3 for one GPU, total 4 GPUs) in stage-1 and Adam’s learning rate is 1e-4.

In the second stage of training, the team added adversarial loss and perceptual loss, and instead of using lsdir, they only used 100 paired images provided by the official competition. The results show that the degradation distribution of the official evaluation data is close to that of the 100 images.The specific loss function coefficients are shown below. They trained for a total of 140k iterations in the second stage, with a batch size of 12. The learning rate for Adam is 5e-5.

⬇

1 discriminator=dict(

2 type=’UNetDiscriminatorWithSpectralNorm’,

3 in_channels=3,

4 mid_channels=64,

5 skip_connection=True),

6 pixel_loss=dict(type=’L1Loss’, loss_weight=1.0, reduction=’mean’),

7 perceptual_loss=dict(

8 type=’PerceptualLoss’,

9 layer_weights={

10 ’2’: 0.1,

11 ’7’: 0.1,

12 ’16’: 1.0,

13 ’25’: 1.0,

14 ’34’: 1.0,

15 },

16 vgg_type=’vgg19’,

17 perceptual_weight=1.0,

18 style_weight=0,

19 norm_img=False),

20 gan_loss=dict(

21 type=’GANLoss’,

22 gan_type=’vanilla’,

23 loss_weight=5e-2,

24 real_label_val=1.0,

25 fake_label_val=0),

There is nothing special about the test. For an image, just input it directly into the trained model.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (8)

4.2.2 Overall Approach

In recent years, the diffusion method has achieved remarkable results in the field of image generation, and many methods have recently explored its application in the field of image restoration. Due to the unavailability of data for phase 3 of this competition, the distribution of degradation may differ from phase 2. To increase the generalization ability of our solution, the team use SUPIR[49] as our baseline model.

SUPIR is trained on 20 million images and has good modeling of the distribution of natural images. It supports multiple parameters such as positive prompt, negative prompt, and Classifier free guidance scale to adjust the enhanced results.Due to the short competition time and the lack of open-source training code for SUPIR, they did not perform any training fine-tuning on SUPIR, but based on its RGB results. To obtain preliminary RGB results, most official default configurations have not been changed. Only the parameters listed in the table 2 are different from the default parameters.

Although the results generated by diffusion can be natural in most scenes, fidelity issues may arise in some small texture scenes, such as text, patterns, and architectural lines.Especially in the field of photography, this distortion may be unacceptable to professionals, and even worse than not being processed.To alleviate this issue, as shown in figure 8, they will perform another fusion process based on the SUPIR results to obtain the final result. The input of the fusion module includes the SUPIR result, the original image, and a 0/1 mask. To obtain this 0/1 mask, they used the DeSRA[46] method. For the sake of fidelity, the fusion module will perform a lighter enhancement on the area with a value of 1 in the mask (e.g., using GAN-based methods), while the area with a value of 0 will be kept as unchanged as possible(i.e., using SUPIR’s result).They introduce our fusion module and DeSRA method in sections 4.2.3 and 4.2.6 in detail, respectively.

config	default	ours
positive prompt	Cinematic, High Contrast, highly detailed, taken using a Canon EOS R camera,hyperdetailed photo-realistic maximum detail, 32k, Color Grading, ultra HD, Extrememeticulous detailing, skin pore detailing, hyper sharpness, perfect without deformations.	Cinematic, High Contrast, highly detailed, taken using a Canon EOS R camera,hyperdetailed photo-realistic maximum detail, 32k, Color Grading, ultra HD, extremely meticulous detailing,skin pore detailing, hyper sharpness, perfect without deformations, window glass is very clean
edm_steps	50	100
sdxl_ckpt	sd_xl_base_1.0_0.9vae	Juggernaut-XL_v9_RunDiffusionPhoto_v2
s_cfg	4.0	2.0

4.2.3 Fusion Network

4.2.4 Architecture

To ensure the authenticity of the results from diffusion-based models, their fusion module performs fine-tuning based on a binary mask. Specifically, the model takes in three components during inference: the output from SUPIR, the original image, and a binary mask. Areas, where the mask is zero, indicate that the results from SUPIR are already optimal and do not necessitate any modifications, so they will keep this area. Conversely, regions where the mask is one suggest that the results require re-generation to maintain fidelity. They will replace this area with the corresponding LR part to input the model.

In light of the above, the fusion module operates akin to an image inpainting task[53, 54], with the key difference that the masked areas are not entirely devoid of information; instead, they contain low-quality images that are awaiting enhancement.In the training process, the team continue to follow the Real-ESRGAN strategy to generate paired (LR, GT) on the LSDIR dataset.As illustrated in the figure9, their model backbone continues to employ SRFormer[59] (consistent with Phase 2), with the only change being the inputs. At this point, the input will encompass the LR, mask, as well as the GT and LR combinations derived from the mask. In the inference process, the GT depicted in Figure9 should be substituted with the outcomes yielded by SUPIR.

For the mask used during training, they generate it randomly following the method outlined in STTN[53], while during testing, they utilize the DeSRA[46] approach to obtain the mask. Regarding the DeSRA method, it will be introduced later.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (9)

Given that the inputs already contain regions of high quality, the loss function must be correspondingly modified to account for this.Similar to image inpainting tasks[54], the loss function encompasses hole loss, valid loss, perceptual loss, and adversarial loss. Notably, for the generated fake image, the discriminator employs the technique of using soft labels when calculating the least square loss[54], rather than the hard labels of 0 and 1. This design allows the discriminator to better discern potential mask areas.

4.2.5 Training Details

The team used SrFormer[59] as the backbone, the specific parameters are shown in the following code.

⬇

1 window_size = 24,

2 embed_dim = 360,

3 depths=(6, 6, 6, 6, 6),

4 num_heads=(6, 6, 6, 6, 6),

5 mlp_ratio=3

The dataset they used is LSDIR[17].During training, the team constructed pairs with a resolution of 144x144. The degradation hyperparameters are the same as those for real-esrgan.They trained 172k iterations with batch size=8 (2 for one GPU, total 4 GPUs) and Adam’s learning rate is 1e-4.The specific loss function coefficients are shown below.

⬇

1 valid_loss = dict(type=’Valid_loss’, loss_weight=0.3),

2 hole_loss = dict(type=’Hole_loss’, loss_weight=0.01),

3 perceptual_loss=dict(

4 type=’PerceptualLoss’,

5 vgg_type=’vgg19’,

6 layer_weights={

7 ’1’: 1.,

8 ’6’: 1.,

9 ’11’: 1.,

10 ’20’: 1.,

11 ’29’: 1.,

12 },

13 layer_weights_style={

14 ’8’: 1.,

15 ’17’: 1.,

16 ’26’: 1.,

17 ’31’: 1.,

18 },

19 perceptual_weight=0.2,

20 style_weight=150,

21 norm_img=False,

22 ),

23 gan_loss=dict(

24 type=’GANLoss’,

25 gan_type=’lsgan’,

26 loss_weight=0.02,

27 real_label_val=1.0,

28 fake_label_val=0)

The generation of random masks during training can be referenced at the specified line in the following GitHub repository: STTN GitHub Repository.

4.2.6 DeSRA Method

With the fusion model in place, it is necessary to ascertain the masks used during testing, specifically identifying the regions where the diffusion results are distorted.A straightforward method involves manual annotation of masks, but this approach is not only unfair in the context of competition but also labor-intensive.

The team employ the methodology from DeSRA[46] for identifying GAN artifacts, utilizing a combination of structural similarity metrics and semantic segmentation outcomes to generate masks.To be precise, they ascertain the mask by contrasting the outputs from the GAN model with those from the diffusion model. The GAN model utilized in this process is the one that has been adequately trained during Phase 2. This choice is motivated by the fact that, despite the GAN model’s potential shortcomings in visual quality, it excels in preserving fidelity in intricate details such as text and textures.By adjusting the parameters, the team strives to align the distribution of the masks with human visual perception. It is important to note that no special parameters are used for any individual image; the same set of parameters is applied consistently across all 50 images.

To enhance the accuracy of the segmentation, they have utilized the Mask2Former model[10] for this task. Compared to the SegFormer model[45] used in the original DeSRA, Mask2Former represents a more advanced approach.Within the provided code, they have included scripts for mask generation, which encompass all the parameters used, including the weights for semantic categories, contrast_threshold, area_threshold, and so on.

4.3 Team So Elegant

The team proposed a Consistency-guided Stable Diffusion method for Image Restoration.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (10)

As shown in Figure 10, the proposed Consistency Guided Stable Diffusion (CGSD) model has three primary stages. Stage 1 is based on a CNN-based restoration model DiffIR [44] to remove diversified degradations. DiffIR uses the powerful mapping capability of the diffusion model to estimate a compact IR prior representation (IPR) to guide image restoration, thereby improving the recovery efficiency and stability of the diffusion model in image restoration. To bridge the domain gap, the degradation of the given data is used to customize the degradation distribution for training [57], which improves the performance of the target test images while maintaining generalization performance. Additionally, BSRGAN [55] is used to simulate image degradation to generate pairs of data for training. And, virtual focus blur is added to BSRGAN to better suit the target test images. For stage 2, Stable Diffusion (SD) [31] is leveraged to refine the texture and details. To improve the fidelity of SD model restoration, a Consistency-Guided Sampling Module (CGS) is proposed to limit the generation. Specifically, the CGS module takes the recovered image of stage 1 as the consistent guidance in each decoding step and aligns the recovery results of each step with it:

x_{t-1}\leftarrow x_{t-1}+\sigma_{t}(x_{s1}-x_{t-1})

(3)

where $x_{t-1}$ and $x_{s1}$ corresponds to the noise-free predicted output at step $t-1$ and the recovered $I_{s1}$ latent. $\sigma_{t}$ represents the weight of the guidance. The image structure is determined in the early diffusion step, and the later stage mainly generates high-frequency details. The final stage 3 is proposed to address the contexture distortion caused by the diffusion model. The contextual information from $I_{s1}$ guides the refined image $I_{s2}$ . Similar to [15], deformable convolution[12] is employed to warp the details in $I_{s2}$ to match the fidelity of $I_{s1}$ . A problematic mask [46] $M$ located by a relative local variance distance from $I_{s1}$ and $I_{s2}$ and semantic-aware thresholds are used as the additional condition. The method is implemented in Pytorch and trained using 8x Nvidia V100 GPU for training. For stage 1, the team first uses the original configuration from DiffIR for training and then adjusts the learning rate to 5e-5, batch size to 2, and trains 10K iterations at a resolution of 512x512. For stage 2, they train the SD model using the AdamW [22] optimizer with a learning rate of 1e-4 and a batch size of 64 for 50K steps. For stage 3, they use a batch size of 2 and a patch size of 1024x1024 for training. Adam is used as the optimizer with a learning rate of 1e-4. And they train the model for 20K iterations.

4.4 Team IIP_IR

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (11)

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (12)

The team IIP_IR has introduced an integrated framework called Degradation-Aware Image Restoration(DAIR) based on the FFTformer architecture introduced in [14] for phase 2. DAIR comprises three main components: Degradation Kernel Estimation (DKE), Degradation Representation Injection (DRI), and FFTfromer. The team’s innovative approach, as illustrated in Figure 11, has the potential to enhance existing models and improve overall performance.

To enable the model to process the degradation of the images, they utilize a method to learn per-pixel degradation convolution kernels similar to blur kernels, which can reconstruct LQ when convolved with HQ images. Unlike the blur kernel, DKE does not constrain the reconstructed kernel to have positive weights that sum to one, thus learning richer degenerate representation.

To maximize the retention of degraded information for image restoration models, the kernels estimated by DKE will be embedded into the Spatially Adaptive representation and injected into U-Net architecture, which is processed through a SPADE module [29]. The processing of the SPADE module does not change the network structure, thus DKE and DRI can be applied directly to any Unet-based image restoration model.

In the training process, the team uses the method mentioned in [6] to generate paired data for pre-training the model, improves its generalization ability and adaptability, and finally fine-tunes the model using 100 pairs. While L1 loss normally trained networks which usually produce smooth/blurry results, they apply perception loss and GAN loss constraints to reconstructed LQ and HQ for both the pre-training phase and fine-tuning phase to increase the realism of the image.

Figure 11 illustrates the pipeline of phase 3. The team utilizes the model to refine the details of the pre-processed images from phase 2. The images first undergo x2 upscaling using HAT[8] to enrich the textures. The initial upscaling phase effectively mitigates distortions of small-scale details such as texts during the texture generation process leveraging pre-trained diffusion priors. They employ StableSR[36] with SD-Turbo, to further refine the upscaled images, producing realistic textures in regions with severe degradations. The refined images were then downscaled with LANCZOS interpolation to obtain the final output.

4.5 Team DACLIP-IR

Team DACLIP-IR proposed a photo-realistic image restoration method with enriched vision-language features.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (13)

The model is built upon the IR-SDE[24] and DACLIP-UIR[23]. Since no training datasets are provided in this challenge, the team chooses to generate LQ images using a similar pipeline as in Real-ESRGAN[41] but with an index-shuffling strategy, as shown in Figure12. Based on the synthetic dataset, they retrain DA-CLIP to enhance LQ features by minimizing an $\ell_{1}$ distance between LQ embeddings and HQ embeddings. Then they incorporate the enhanced LQ embeddings into IR-SDE with cross-attention to restore HQ images, similar to DA-CLIP[23]. In addition, they propose a posterior sampling approach for IR-SDE that improves both fidelity and perceptual performance. To further improve the generalization ability, they first train the model on the LSDIR dataset[17] and then finetune it on a mixed dataset with both synthetic and real-world images for phase two and phase three. Note that they use the same model for phase two and phase three, but take the original reverse-time SDE for phase three for better visual performance (small noise makes the photo look more realistic).

Specific training details for phase two: The team adds the paired validation dataset in phase one to further fine-tune the model, which improves a lot across all metrics.

Specific training details for phase three: They use the same model trained from phase two for phase three. To make the image look non-smooth and oil-painted, they use the original reverse-time SDE during inference.

4.6 Team TongJi-IPOE

Team TongJi-IPOE proposed a DRBFormer-StableSR fusion Network for restoring any image model in the Wild.

Method. The overall architecture is shown in Figure 13. The proposed network consists of two parts: DRBFormer image restoration network and StableSR [37] image SR network. DRBFormer uses Restormer Blocks as the backbone. Inspired by [51], a multi-scale dynamic residual module DRB is designed in the decoding network to better to better handle the varying blur [33]. Considering that Diffusion priors can improve the performance of restored images, the network adopts the fusion method of Eq. (4) for image restoration. Due to the randomness of the diffusion model, the generated image may deviate from the real situation, so the adjustable coefficient $t$ was set to 0.9 in this competition.

\displaystyle\hat{I}

\displaystyle=t*DRBFormer(I_{blur})+(1-t)*StableSR(I_{blur})

(4)

where $\hat{I}$ is the result of restoration, $t\in\left[0,1\right]$ is adjustable coefficient and $I_{blur}$ is blurryimage.

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (14)

Training strategy. In total, four datasets are used including DPDD[2], SIDD[1], GoPro[26] and NH-HAZE[4]. To train the models with images, the training dataset is augmented with random clipping. The details of the training steps are as follows:
1. Pretraining on combined datasets. Ground truth patches of size 128 $\times$ 128 are randomly cropped from Ground truth images, and the mini-batch size is set to 8. The model is trained by minimizing weighted L1 loss and perceptual loss function with Adam optimizer. The initial learning rate is set to 3 $\times$ $10^{-}4$ and the total number of iterations is 392k.
2. Finetuning on combined datasets. For the model to adapt to higher resolution image processing, crop the image to 160 $\times$ 160,192 $\times$ 192,256 $\times$ 256,320 $\times$ 320,384 $\times$ 384, and set the mini-batch size to [5,3,2,1,1]. The model is trained by minimizing weighted L1 loss and perceptual loss function with Adam optimizer. The initial learning rate is set to 3 $\times$ $10^{-}4$ and adjusted by cosine annealing. The total number of iterations is 208k.

4.7 Team ImagePhoneix

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (15)

Team ImagePhoneix adopted DiffIR [44] as the baseline network, as shown in Figure 1. They froze the “stage 2" of the DiffIR and fine-tune its “stage 1” network on the provided LR-HR image pairs.

Implementation details.

With provided image pairs, they first cropped them into sub-images of the size $400\times 400$ for accelerating I/O speed, resulting in a total number of $2500$ sub-images. To fine-tune the pre-trained model, all the sub-images are cropped into image patches with the size $256\times 256$ . They randomly flipped and rotated the input images for data augmentation. Adam algorithm is adopted with $\beta_{1}=0.9$ and $\beta_{2}=0.99$ to update the model parameters. They set the initial learning rate and the total number of iterations to $1\times 10^{-4}$ and $1\times 10^{5}$ , respectively. However, the encoder is updated with a different strategy, which updates the model parameters of the encoder in $2.5\times 10^{4}$ iterations and sets the initial learning rate to $2\times 10^{-4}$ . The learning rate of the encoder is decay with a factor of $0.1$ in the $1.5\times 10^{4}$ -th iteration. Different from the encoder, the learning rate of the image generator is decay with a factor of $0.5$ at the $8.0\times 10^{4}$ -th iteration.

In Phase II, the evaluation metric is a linear combination of the reconstruction and perceptual measurements. To handle this issue, The team adopted a hybrid loss function to fine-tune the model, which involves $L_{1}$ loss, perceptual loss based on VGG features $L_{\text{vgg}}$ , adversarial loss $L_{\text{GAN}}$ , and Kullback–Leibler divergence $L_{\text{KL}}$ . The total loss is defined as $\mathcal{L}=\lambda_{1}L_{1}+\lambda_{2}L_{\text{vgg}}+\lambda_{3}L_{\text{GAN%}}+\lambda_{4}L_{\text{KL}}$ , where $L_{1}$ loss measure the reconstruction error of the generated images, $L_{\text{vgg}}$ aims to improve the perceptual quality of images, $L_{\text{GAN}}$ and $L_{\text{KL}}$ measure the distribution distance between the generated images and the ground-truth images in the spatial and latent spaces, respectively. $\lambda_{1},\lambda_{2},\lambda_{3}$ , and $\lambda_{4}$ are hyper-parameters to balance the distortion and perceptual quality of images and set to $1.0$ in this Phase.

4.7.1 Phase III: Evaluation on Subjective Measurements

In Phase III, the team aims to improve the perceptual quality of generated images. Instead of using perceptual loss based on the VGG features, they adopt the robust distribution loss [27] which minimizes the distribution distance between the generated images and the ground-truth images based on Fast Fourier transform (FFT). Given the generated image $x$ and the ground-truth image $y$ , the robust distribution loss $L_{\text{freq}}$ is defined as follows:

L_{\text{freq}}(x,y)=L_{\text{WD}}(\mathcal{A}_{x},\mathcal{A}_{y})+\lambda_{%\text{phase}}L_{\text{WD}}(\mathcal{P}_{x},\mathcal{P}_{y}),

(5)

where $\mathcal{A}_{x}=|\mathcal{F}(x)|$ and $\mathcal{A}_{y}=|\mathcal{F}(y)|$ denote the frequency spectrum of the images $x$ and $y$ via FFT $\mathcal{F}$ , respectively. $\mathcal{P}_{x}$ and $\mathcal{P}_{y}$ represent the phase of $\mathcal{F}(x)$ and $\mathcal{F}(y)$ , respectively. $L_{\text{WD}}$ is the Wasserstein distance, and $\lambda_{\text{phase}}$ is the hyper-parameter that is set to $0.1$ in the fine-tuning procedure.

4.8 Team HIT-IIL

The team HIT-IIL used the degradation process of Real-ESRGAN[41] and replaced the backbone with Restormer[52]. For phase 2, they only trained a Real-ESRGAN x1plus model with an additional lpips loss. For phase 3, they used the backbone of Restormer to train a new x1model and averaged the results with weights 0.8 and 0.2, respectively.

They use DF2K (DIV2K and Flickr2K) datasets to train the model. For pre-processing, they use a multi-scale strategy, i.e., they downsample HR images to obtain several Ground-Truth images with different scales. They then crop DF2K images into sub-images for faster IO and processing.

4.9 Team MARSHAL

4.9.1 Methods details

The team observed that the input images and evaluation criteria of the two phases are different. The input images in phase 2 have higher quality. The evaluation criteria for this phase are based on reference evaluation indicators. The input quality of phase 3 is relatively low, with a more serious blur. This phase uses the method of manual scoring to select images with better visual effects as the winners. Taking into account the existing solutions, the team decided to adopt a gan-based approach in phase 2 to obtain higher objective indicator scores. In phase 3, a diffusion-based approach is adopted to make the results more visually appealing.

4.9.2 Phase 2

The organizers provided 100 pairs of training images whose input quality and imaging style are similar to the test set of the first stage. Therefore, they chose DiffIR [44] for this stage. It only uses the diffusion process to model the condition branch, and the main network is trained using the GAN loss, so it rarely destroys local details (such as text, small faces), and can obtain a higher objective evaluation index. They directly use the pre-trained model of DiffIR and fine-tune it with the paired dataset provided by the organizer, so that the team can quickly obtain a good result. The whole finetuning process continues 6.6 k iterations with a batch size of 48. In addition, in the process of preparing the dataset, they adopted a multi-scale downsampling strategy, hoping that the model could gain knowledge of different scales. The downsampling scales are set to 0.75, 0.5, and 0.33, respectively.

4.9.3 Phase 3

In phase 3, the test set provided by the organizer has a significant domain gap from the test set of phase 2, and the degradation is more severe. The team thinks they cannot directly use the model that performs well in phase 2 to obtain good visual results in the second stage, so they switched to using the methods [36, 48, 43, 34] of the pre-training diffusion model. As shown in Fig. 15, the team chooses the popular ControlNet [56] as the solution. Following [19], they use pretrained VAE encoder as the image encoder. In terms of training data, they choose LSDIR [17], which contains tens of thousands of texture-rich images. As for data degradation, to match the more severe degradation of the test set, they choose realesrgan’s [41] degradation pipeline to synthesize paired data. They train the model with a batch size of 32 for 100k iterations. In the inference stage, the team resizes the input to 2048 before feeding it into the model, which aims to preserve small structures like texts as shown in Fig. 16. The team also adopts the LRE strategy proposed in [43] to improve fidelity. The pre-trained diffusion model in this solution is SD2-base [32].

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (16)

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge (17)

5 Acknowledgments

This work was partially supported by the Humboldt Foundation. We thank the NTIRE 2024 sponsors: Meta Reality Labs, OPPO, KuaiShou, Huawei and University of Würzburg (Computer Vision Lab).

6 Appendix: Teams and affiliations

NTIRE 2024 Team

Challenge:

NTIRE 2024 Restore Any Image Model (RAIM) in the Wild

Organizers:

Jie Liang¹ (liang27jie@gmail.com)

Qiaosi Yi^1,2 (qiaosi.yi@connect.polyu.hk)

Shuaizheng Liu^1,2 (shuaizhengliu21@gmail.com)

Lingchen Sun^1,2 (ling-chen.sun@connect.polyu.hk)

Xindong Zhang¹ (17901410r@connect.polyu.hk)

Hui Zeng¹ (cshzeng@gmail.com)

Prof. Lei Zhang^1,2 (cslzhang@comp.polyu.edu.hk)

Prof. Radu Timofte³ (radu.timofte@uni-wuerzburg.de)

Affiliations:

¹ OPPO Research Institute

² The Hong Kong Polytechnic University

³ Computer Vision Lab, University of Würzburg, Germany

Team MiAlgo

Members:

Yibin Huang (huangyibin@xiaomi.com)

Shuai Liu, Yongqiang Li, Chaoyu Feng, Xiaotao Wang, Lei Lei

Affiliations:

Xiaomi Inc., China

Team Xhs-IAG

Members:

Yuxiang Chen¹ (chenyuxiang@xiaohongshu.com)

Xiangyu Chen^2,3, Qiubo Chen¹

Affiliations:

¹ Xiaohongshu,

² University of Macau,

³ Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

Team So Elegant

Members:

Jiaxu Chen

Fengyu Sun (sunfengyu@s.upc.edu.cn), Mengying Cui

Affiliations:

China University of Petroleum (East China)

Team IIP_IR

Members:

Zhenyu Hu (zhenyuhu@whu.edu.cn),

Jingyun Liu, Wenzhuo Ma, Ce Wang, Hanyou Zheng, Wanjie Sun, Zhenzhong Chen

Affiliations:

School of Remote Sensing and Information Engineering, Wuhan University

Team DACLIP-IR

Members:

Ziwei Luo (ziwei.luo@it.uu.se)

Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön

Affiliations:

Department of Information Technology, Uppsala University

Team TongJi-IPOE

Members:

Xiong Dun

Pengzhou Ji (jipengzhoudrew@163.com), Yujie Xing, Xuquan Wang, Zhanshan Wang, Xinbin Cheng

Affiliations:

Institute of Precision Optical Engineering, School of Physics Science and Engineering, Tongji University

Team ImagePhoneix

Members:

Jun Xiao¹ (jun.xiao@connect.polyu.hk)

Chenhang He¹, Xiuyuan Wang¹, Zhi-Song Liu²

Affiliations:

¹ The Hong Kong Polytechnic University

² Lappeenranta-Lahti University of Technology

Team HIT-IIL

Members:

Zimeng Miao (2214177602@qq.com)

Zhicun Yin, Ming Liu, Wangmeng Zuo

Affiliations:

School of Computer Science and Technology, Harbin Institute of Technology

Team MARSHAL

Members:

Rongyuan Wu (rong-yuan.wu@connect.polyu.hk)

Shuai Li

Affiliations:

The Hong Kong Polytechnic University

References

Abdelhamed etal. [2018]Abdelrahman Abdelhamed, Stephen Lin, and MichaelS. Brown.A high-quality denoising dataset for smartphone cameras.In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1692–1700, 2018.
Abuolaim and Brown [2020]Abdullah Abuolaim and M.S. Brown.Defocus deblurring using dual-pixel data.In European Conference on Computer Vision, 2020.
Ancuti etal. [2024]Cosmin Ancuti, CodrutaO Ancuti, Florin-Alexandru Vasluianu, Radu Timofte, etal.NTIRE 2024 dense and non-hom*ogeneous dehazing challenge report.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Ancuti etal. [2020]CodrutaO. Ancuti, Cosmin Ancuti, and Radu Timofte.Nh-haze: An image dehazing benchmark with non-hom*ogeneous hazy and haze-free images.In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1798–1805, 2020.
Banić etal. [2024]Nikola Banić, Egor Ershov, Artyom Panshin, Oleg Karasev, Sergey Korchagin, Shepelev Lev, Alexandr Startsev, Daniil Vladimirov, Ekaterina Zaychenkova, DmitriiR Iarchuk, Maria Efimova, Radu Timofte, Arseniy Terekhin, etal.NTIRE 2024 challenge on night photography rendering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Carbajal etal. [2021]Guillermo Carbajal, Patricia Vitoria, Mauricio Delbracio, Pablo Musé, and José Lezama.Non-uniform blur kernel estimation via adaptive basis decomposition.arXiv preprint arXiv:2102.01026, 2021.
Chahine etal. [2024]Nicolas Chahine, MarcosV. Conde, Sira Ferradans, Radu Timofte, etal.Deep portrait quality assessment. a NTIRE 2024 challenge survey.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Chen etal. [2023]Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong.Activating more pixels in image super-resolution transformer.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023.
Chen etal. [2024]Zheng Chen, Zongwei WU, EduardSebastian Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, etal.NTIRE 2024 challenge on image super-resolution (×4): Methods and results.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Cheng etal. [2022]Bowen Cheng, Ishan Misra, AlexanderG Schwing, Alexander Kirillov, and Rohit Girdhar.Masked-attention mask transformer for universal image segmentation.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
Conde etal. [2024]MarcosV. Conde, Florin-Alexandru Vasluianu, Radu Timofte, etal.Deep raw image super-resolution. a NTIRE 2024 challenge survey.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Dai etal. [2017]Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.Deformable convolutional networks.In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
Ignatov etal. [2020]Andrey Ignatov, Radu Timofte, Zhilu Zhang, Ming Liu, Haolin Wang, Wangmeng Zuo, Jiawei Zhang, Ruimao Zhang, Zhanglin Peng, Sijie Ren, etal.Aim 2020 challenge on learned image signal processing pipeline.In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 152–170. Springer, 2020.
Kong etal. [2023]Lingshun Kong, Jiangxin Dong, Jianjun Ge, Mingqiang Li, and Jinshan Pan.Efficient frequency domain-based transformers for high-quality image deblurring.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5886–5895, 2023.
Li etal. [2020]Xiaoming Li, Chaofeng Chen, Shangchen Zhou, Xianhui Lin, Wangmeng Zuo, and Lei Zhang.Blind face restoration via deep multi-scale component dictionaries.In European conference on computer vision, pages 399–415. Springer, 2020.
Li etal. [2024]Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, etal.NTIRE 2024 challenge on short-form UGC video quality assessment: Methods and results.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Li etal. [2023]Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Demandolx, etal.Lsdir: A large scale dataset for image restoration.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023.
Liang etal. [2024]Jie Liang, Qiaosi Yi, Shuaizheng Liu, Lingchen Sun, Rongyuan Wu, Xindong Zhang, Hui Zeng, Radu Timofte, Lei Zhang, etal.NTIRE 2024 restore any image model (RAIM) in the wild challenge.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Lin etal. [2023]Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, and Chao Dong.Diffbir: Towards blind image restoration with generative diffusion prior.arXiv preprint arXiv:2308.15070, 2023.
Liu etal. [2024a]Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, etal.NTIRE 2024 quality assessment of AI-generated content challenge.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024a.
Liu etal. [2024b]Xiaoning Liu, Zongwei WU, Ao Li, Florin-Alexandru Vasluianu, Yulun Zhang, Shuhang Gu, Le Zhang, Ce Zhu, Radu Timofte, etal.NTIRE 2024 challenge on low light image enhancement: Methods and results.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024b.
Loshchilov and Hutter [2018]Ilya Loshchilov and Frank Hutter.Fixing weight decay regularization in adam.2018.
Luo etal. [2023a]Ziwei Luo, FredrikK Gustafsson, Zheng Zhao, Jens Sjölund, and ThomasB Schön.Controlling vision-language models for universal image restoration.arXiv preprint arXiv:2310.01018, 2023a.
Luo etal. [2023b]Ziwei Luo, FredrikK Gustafsson, Zheng Zhao, Jens Sjölund, and ThomasB Schön.Image restoration with mean-reverting stochastic differential equations.arXiv preprint arXiv:2301.11699, 2023b.
Maggioni etal. [2021]Matteo Maggioni, Yibin Huang, Cheng Li, Shuai Xiao, Zhongqian Fu, and Fenglong Song.Efficient multi-stage video denoising with recurrent spatio-temporal fusion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3466–3475, 2021.
Nah etal. [2016]Seungjun Nah, TaeHyun Kim, and KyoungMu Lee.Deep multi-scale convolutional neural network for dynamic scene deblurring.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 257–265, 2016.
Ni etal. [2024]Zhangkai Ni, Juncheng Wu, Zian Wang, Wenhan Yang, Hanli Wang, and Lin Ma.Misalignment-robust frequency distribution loss for image transformation.arXiv preprint arXiv:2402.18192, 2024.
Ning etal. [2022]Qian Ning, Jingzhu Tang, Fangfang Wu, Weisheng Dong, Xin Li, and Guangming Shi.Learning degradation uncertainty for unsupervised real-world image super-resolution.In IJCAI, pages 1261–1267, 2022.
Park etal. [2019]Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.Semantic image synthesis with spatially-adaptive normalization.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
Ren etal. [2024]Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, etal.The ninth NTIRE 2024 efficient super-resolution challenge report.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Rombach etal. [2022a]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022a.
Rombach etal. [2022b]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
Ruan etal. [2022]Lingyan Ruan, Bin Chen, ji*zhou Li, and Miuling Lam.Learning to deblur using light field generated and real defocus images.In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16283–16292, 2022.
Sun etal. [2023]Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hongwei Yong, and Lei Zhang.Improving the stability of diffusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2023.
Vasluianu etal. [2024]Florin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou, Zongwei WU, Cailian Chen, Radu Timofte, etal.NTIRE 2024 image shadow removal challenge report.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Wang etal. [2023a]Jianyi Wang, Zongsheng Yue, Shangchen Zhou, KelvinCK Chan, and ChenChange Loy.Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023a.
Wang etal. [2023b]Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C.K. Chan, and ChenChange Loy.Exploiting diffusion prior for real-world image super-resolution.ArXiv, abs/2305.07015, 2023b.
Wang etal. [2024a]Longguang Wang, Yulan Guo, Juncheng Li, Hongda Liu, Yang Zhao, Yingqian Wang, Zhi Jin, Shuhang Gu, Radu Timofte, etal.NTIRE 2024 challenge on stereo image super-resolution: Methods and results.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024a.
[39]Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan.Real-esrgan: Training real-world blind super-resolution with pure synthetic data.In International Conference on Computer Vision Workshops (ICCVW).
Wang etal. [2018]Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and ChenChange Loy.Esrgan: Enhanced super-resolution generative adversarial networks.In The European Conference on Computer Vision Workshops (ECCVW), 2018.
Wang etal. [2021]Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan.Real-esrgan: Training real-world blind super-resolution with pure synthetic data.In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
Wang etal. [2024b]Yingqian Wang, Zhengyu Liang, Qianyu Chen, Longguang Wang, Jungang Yang, Radu Timofte, Yulan Guo, etal.NTIRE 2024 challenge on light field image super-resolution: Methods and results.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024b.
Wu etal. [2023]Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang.Seesr: Towards semantics-aware real-world image super-resolution.arXiv preprint arXiv:2311.16518, 2023.
Xia etal. [2023]Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, and Luc VanGool.Diffir: Efficient diffusion model for image restoration.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13095–13105, 2023.
Xie etal. [2021]Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, JoseM Alvarez, and Ping Luo.Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021.
Xie etal. [2023]Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, and Chao Dong.Desra: detect and delete the artifacts of gan-based real-world super-resolution models.arXiv preprint arXiv:2307.02457, 2023.
Yang etal. [2024]Ren Yang, Radu Timofte, etal.NTIRE 2024 challenge on blind enhancement of compressed image: Methods and results.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Yang etal. [2023]Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang.Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization.arXiv preprint arXiv:2308.14469, 2023.
Yu etal. [2024]Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong.Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild.arXiv preprint arXiv:2401.13627, 2024.
ZamaRamirez etal. [2024]Pierluigi ZamaRamirez, Fabio Tosi, Luigi DiStefano, Radu Timofte, Alex Costanzino, Matteo Poggi, etal.NTIRE 2024 challenge on HR depth from images of specular and transparent surfaces.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Zamir etal. [2021]SyedWaqas Zamir, Aditya Arora, SalmanHameed Khan, Munawar Hayat, FahadShahbaz Khan, and Ming-Hsuan Yang.Restormer: Efficient transformer for high-resolution image restoration.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5718–5729, 2021.
Zamir etal. [2022]SyedWaqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, FahadShahbaz Khan, and Ming-Hsuan Yang.Restormer: Efficient transformer for high-resolution image restoration.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5728–5739, 2022.
Zeng etal. [2020]Yanhong Zeng, Jianlong Fu, and Hongyang Chao.Learning joint spatial-temporal transformations for video inpainting.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 528–543. Springer, 2020.
Zeng etal. [2022]Yanhong Zeng, Jianlong Fu, Hongyang Chao, and Baining Guo.Aggregated contextual transformations for high-resolution image inpainting.IEEE Transactions on Visualization and Computer Graphics, 2022.
Zhang etal. [2021]Kai Zhang, Jingyun Liang, Luc VanGool, and Radu Timofte.Designing a practical degradation model for deep blind image super-resolution.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
Zhang etal. [2023a]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
Zhang etal. [2023b]Ruofan Zhang, Jinjin Gu, Haoyu Chen, Chao Dong, Yulun Zhang, and Wenming Yang.Crafting training degradation distribution for the accuracy-generalization trade-off in real-world super-resolution.In International Conference on Machine Learning, pages 41078–41091. PMLR, 2023b.
Zhang etal. [2024]Zhilu Zhang, Shuohao Zhang, Renlong Wu, Wangmeng Zuo, Radu Timofte, etal.NTIRE 2024 challenge on bracketing image restoration and enhancement: Datasets, methods and results.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024.
Zhou etal. [2023]Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, and Qibin Hou.Srformer: Permuted self-attention for single image super-resolution.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12780–12791, 2023.