Single Image Super-Resolution using Hybrid Attention Transformer

Paper Overview
Literature Review
Biography
Social Impact
Industry Applications
Follow-on Research
Peer Review
Code Implementation and Experiments
References

Paper Overview

Problem: Image Super Resolution

Single image super-resolution is a classic and challenging problem in computer vision. It has practical applications in various domains, including medical imaging, satellite imagery, and enhancing the quality of digital photographs.
We input a low-resolution image and ask the network to generate a higher-resolution image.

Low-Resolution Image

Traditional Interpolated Image

AI Super Resolution Image

Solution: Hybrid Attention Transformer (HAT)

CNN-based methods have long dominated the Image super-resolution (SR) field. Recently Transformers have attracted attention. SwinIR a transformer-based model, obtained a breakthrough. Despite the success, "Why Transformer is better than CNN" remains a mystery. An intuitive explanation is that transformers can benefit from the self-attention mechanism and utilize long-range information. Interestingly, authors find that SwinIR does NOT exploit more input pixels than CNN-based methods. The proposed HAT (Hybrid Attention Transformer achieves higher pixel activation as shown below.

Novel Contributions

Authors have designed a novel Hybrid Attention Transformer (HAT) that combines self-attention, channel attention and a new overlapping cross-attention to activate more pixels for better reconstruction
Authors propose an effective same-task pre-training strategy to further exploit the potential of SR Transformer and show the importance of large-scale data pre-training for the task
Method achieves state-of-the-art performance

Results

Literature Review

Evolution of Image Super Resolution: The journey of image super-resolution (ISR) has evolved from basic image enhancement techniques to the use of advanced deep learning models. Particularly, Convolutional Neural Networks (CNNs) have played a pivotal role in addressing resolution limitations across various fields.

Transformers in Vision: Building on the success in natural language processing, Transformers have been adopted into the realm of computer vision. This transition brought significant improvements in capturing long-term dependencies and processing global information in ISR tasks. In 2021, SwinIR a landmark Transformer-based network was introduced that achieved breakthrough improvement in SR.

Introduction of Local Attribution Maps (LAM): The 2021 LAM paper concluded that SR networks with a wider range of involved input pixels could achieve better performance. LAM performs attribution analysis of SR networks, which aims at finding the input pixels that strongly influence the SR results.

Comparison of the SR results and LAM attribution results of different SR networks. The LAM results visualize the importance of different pixels w.r.t. the SR results

Development of HAT: This paper introduces the Hybrid Attention Transformer (HAT), a novel approach that synergizes channel and self-attention mechanisms. Through the use of LAM, this development aims to overcome previous limitations of Transformers in ISR, enabling optimized utilization of input information for enhanced image reconstruction.

Biography

Xiangyu Chen

Xiangyu Chen, currently a joint Ph.D. student at the University of Macau and Shenzhen Institute of Advanced Technology, specializes in computer vision and computational photography with a focus on image super-resolution and general image restoration.

Xintao Wang

Xintao Wang, a senior staff researcher at Tencent ARC Lab, leads efforts in visual content generation. He completed his Ph.D. at the Chinese University of Hong Kong and has made significant contributions to the field of image and video generation/editing.

Jiantao Zhou

Professor Jiantao Zhou, from the University of Macau, PhD in ECE, Hong Kong University of Science and Technology

Yu Qiao

Professor of Shanghai AI Laboratory; Shenzhen Institutes of Advanced Technology, Ph.D. in Mechanical Engineering @ MIT

Chao Dong

Full Professor Shenzhen Institutes of Advanced Technology PhD @ The Chinese University of Hong Kong

Exploring the common link,

Xiangyu Chen and Jiantao Zhou are with State Key Laboratory of Internet of Things for Smart City, University of Macau.
Xiangyu Chen, Xiangtao Kong, Chao Dong and Yu Qiao are with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Xiangyu Chen, Wenlong Zhang, Xiangtao Kong, Chao Dong and Yu Qiao are with Shanghai Artificial Intelligence Laboratory, Shanghai, China
Xintao Wang is with the Applied Research Center, Tencent PCG, Shenzhen, China

Social Impact

Since the HAT approach advances the state of the art in the Image Super-Resolution field it presents itself as the front-runner candidate to be used in Security and Surevellience products. As we know a deep learning model often hallucinates to generate additional details in the image, extra care should be taken to mitigate model biases.
Similarly in the field of Medicine, care should be taken to make the Super Resolution model more reliable and only use it where hallucination is acceptable.

Industry Applications

HAT promises SOTA performance and has established itself as an industry-ready model.
This paper was supported by Tencent ARC lab, an active player in the super-resolution field. Their use case of the technology lies in their cloud and social media services. Cloud services benefit from super-resolution as it helps upscale low-resolution images in real-time, saving the space that a high-resolution counterpart would have otherwise occupied.
Hence HAT provides companies like Tencent with an improved approach for implementing super resolution onto their products and services.

Follow-on Research

This paper introduces the idea of activating more pixels to get better image reconstructions. Further research could be done to take this idea forward and produce even better-looking image reconstructions.
A very recent paper by D. Zhang et al. named "SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution" aims to do this. They propose SwinFIR and HATFIR by replacing Fast Fourier Convolution (FFC) components, which have the image-wide receptive field leading to further improvement in the efficiency of capturing global information.

(a) Residual Hybrid Attention Group (RHAG) in the HAT.
(b) Swin Fourier Transformer Block (SFTB) in HAT FIR.
The convolution (3x3) in RHAG of HAT is replaced
with SFB to get the SFTB.

Quantitative comparison with HAT

Peer Review

Reviewer 1 - Wenyu Zhang

Rate: 7/10 (would accept)

Strengths: Demonstrates significant improvement over existing methods, with extensive experimental validation.

Weaknesses: Future work could benefit from a discussion on model complexity and a broader comparative analysis.

Overall: Offers a substantial contribution to Transformer applications in super-resolution, meriting publication with minor revisions suggested.

Reviewer 2 - Varun Mohan

Rate: 8/10 (Strong Accept)

Strengths:

Achieved SOTA on several test datasets
Novel method to activate more pixels
Well-written paper and good use of figures
Excellent citation of previous work

Weaknesses:

HAT-L variant has a considerably higher model complexity
Requires a lot of resources to pre-train and finetune the model

Overall: Establishes a new SOTA in Image Super-Resolution field, but fails to improve on the model complexity

Code Implementation and Experiments

Experiment Idea

We ran inference on 6 of the latest Super Resolution deep learning models (2 pre-trained model variations of SwinIR, HAT, SwinFIR each). Then we computed the PSNR and SSIM of the generated images and assigned them an opinion score. All the experiments produced 4x scaled images. We used the code from official GitHub repos provided by the respective authors (3 repos linked in references).
We have used 4 different data sets, namely Set5, Set14, PIRM, and Urban. In total 59 images (5 images from Set5, 14 images from Set14, first 20 images from PIRM, and first 20 images from Urban) were used to conduct each of the experiments.
PSNR, SSIM and Opinion scores were used to evaluate each super resolved image. PSNR and SSIM were computed on Y axis. Opinion scores are in the range of 1 to 10. It was assigned by taking the average of the ratings provided by both of us. The score represents the degree of resemblance between the generated image and ground truth image to a human eye. A score of 1 would mean it completely resembles the bicubic interpolated image and 10 would mean it completely resembles the ground truth image.
Results are reported in Table I,II,III,V.

Novel Contribution

Performed tests on 6 latest SR models. Also reported results on PIRM dataset which was not previously included by any of the original papers
Added an Opinion Score metric for model evaluation
Conducted experiments that hint towards HAT-FIR being the most efficient Single Image Super Resolution model as of 2023

Results

State-of-the-art methods performance comparison on PSNR, SSIM, and Opinion Score. HAT-L performs best and HATFIR has comparable performance.

Below are some interesting result samples

Although all the outputs produced look similar at first glance, however on a closer look one can observe that HATFIR and HAT-L are able to distinguish between the wires better than other models. SwinFIR makes for a close second

HAT-L performs the best on this image. It is able to capture the 2 thin wires running parallel and forming a cross pattern on the glass. Whereas, other models fail to discriminate between the 2 wires and represent them as a single thick wire.

Bottom right part of the images generated by SwinIR and SwinIR Light appears to be significantly blurred. Other models capture the cube pattern better. Surprisingly, HATFIR performs better than HAT-L by eradicating the blur almost completely.

None of the models were able to super resolve the number 85. Interestingly they also failed to capture the fine cross-shaped material texture found in the ground truth image.

HAT-L and HATFIR capture the dashed line pattern on the left side of the image. Whereas other models struggle to get it right by either blurring the dashes or combining them into one.

All models perform similarly on this image. However, on a closer look at the front left leg of the zebra around the black circular patch, we notice that all models except SwinIR and SwinIR Light hallucinate an incorrect pattern of stripes (facing upwards).

Conclusion

The survey aimed to compare the best available super-resolution techniques by testing them on a fixed set of datesets (Set5, Set14, PIRM, Urban) and evaluate them not just on PSNR and SSIM, but also on MOS. After conducting the experiments we made the following observations,

HAT-L and HAT-FIR are the best-performing models. HATFIR came very close to HAT-L's performance, especially in Set5, Set14 and PIRM20 datasets
HAT and HAT-L had inference times significantly greater than the rest on CPU
SwinIR Light < SwinIR < HAT < SwinFIR < HATFIR < HAT-L is the general pattern that emerged with respect to the PSNR, SSIM and Opinion Scores. However with just a quick glance, it is difficult for a human eye to perceive a difference between the model outputs especially that of HAT, SwinFIR, HATFIR and HAT-L, making them all a good choice for super resolution.
Accounting for model parameter count, its interesting to see HATFIR match up to HAT-L. This arguably positions HATFIR as the best super-resolution technique available.