Single image super-resolution is a classic and challenging problem in computer vision. It has practical applications in various domains, including medical imaging, satellite imagery, and enhancing the quality of digital photographs.
We input a low-resolution image and ask the network to generate a higher-resolution image.
Low-Resolution Image
Traditional Interpolated Image
AI Super Resolution Image
CNN-based methods have long dominated the Image super-resolution (SR) field. Recently Transformers have attracted attention. SwinIR a transformer-based model, obtained a breakthrough. Despite the success, "Why Transformer is better than CNN" remains a mystery. An intuitive explanation is that transformers can benefit from the self-attention mechanism and utilize long-range information. Interestingly, authors find that SwinIR does NOT exploit more input pixels than CNN-based methods. The proposed HAT (Hybrid Attention Transformer achieves higher pixel activation as shown below.
Evolution of Image Super Resolution: The journey of image super-resolution (ISR) has evolved from basic image enhancement techniques to the use of advanced deep learning models. Particularly, Convolutional Neural Networks (CNNs) have played a pivotal role in addressing resolution limitations across various fields.
Transformers in Vision: Building on the success in natural language processing, Transformers have been adopted into the realm of computer vision. This transition brought significant improvements in capturing long-term dependencies and processing global information in ISR tasks. In 2021, SwinIR a landmark Transformer-based network was introduced that achieved breakthrough improvement in SR.
Introduction of Local Attribution Maps (LAM): The 2021 LAM paper concluded that SR networks with a wider range of involved input pixels could achieve better performance. LAM performs attribution analysis of SR networks, which aims at finding the input pixels that strongly influence the SR results.
Comparison of the SR results and LAM attribution results of different SR networks. The LAM results visualize the importance of different pixels w.r.t. the SR results
Development of HAT: This paper introduces the Hybrid Attention Transformer (HAT), a novel approach that synergizes channel and self-attention mechanisms. Through the use of LAM, this development aims to overcome previous limitations of Transformers in ISR, enabling optimized utilization of input information for enhanced image reconstruction.
Exploring the common link,
Since the HAT approach advances the state of the art in the Image Super-Resolution field it presents itself as the front-runner candidate to be used in Security and Surevellience products. As we know a deep learning model often hallucinates to generate additional details in the image, extra care should be taken to mitigate model biases.
Similarly in the field of Medicine, care should be taken to make the Super Resolution model more reliable and only use it where hallucination is acceptable.
HAT promises SOTA performance and has established itself as an industry-ready model.
This paper was supported by Tencent ARC lab, an active player in the super-resolution field. Their use case of the technology lies in their cloud and social media services. Cloud services benefit from super-resolution as it helps upscale low-resolution images in real-time, saving the space that a high-resolution counterpart would have otherwise occupied.
Hence HAT provides companies like Tencent with an improved approach for implementing super resolution onto their products and services.
This paper introduces the idea of activating more pixels to get better image reconstructions. Further research could be done to take this idea forward and produce even better-looking image reconstructions.
A very recent paper by D. Zhang et al. named "SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution" aims to do this. They propose SwinFIR and HATFIR by replacing Fast Fourier Convolution (FFC) components, which have the image-wide receptive field leading to further improvement in the efficiency of capturing global information.
(a) Residual Hybrid Attention Group (RHAG) in the HAT.
(b) Swin Fourier Transformer Block (SFTB) in HAT FIR.
The convolution (3x3) in RHAG of HAT is replaced
with SFB to get the SFTB.
Quantitative comparison with HAT
Rate: 7/10 (would accept)
Strengths: Demonstrates significant improvement over existing methods, with extensive experimental validation.
Weaknesses: Future work could benefit from a discussion on model complexity and a broader comparative analysis.
Overall: Offers a substantial contribution to Transformer applications in super-resolution, meriting publication with minor revisions suggested.
Rate: 8/10 (Strong Accept)
Strengths:
Weaknesses:
Overall: Establishes a new SOTA in Image Super-Resolution field, but fails to improve on the model complexity
We ran inference on 6 of the latest Super Resolution deep learning models (2 pre-trained model variations of SwinIR, HAT, SwinFIR each). Then we computed the PSNR and SSIM of the generated images and assigned them an opinion score. All the experiments produced 4x scaled images. We used the code from official GitHub repos provided by the respective authors (3 repos linked in references).
We have used 4 different data sets, namely Set5, Set14, PIRM, and Urban. In total 59 images (5 images from Set5, 14 images from Set14, first 20 images from PIRM, and first 20 images from Urban) were used to conduct each of the experiments.
PSNR, SSIM and Opinion scores were used to evaluate each super resolved image. PSNR and SSIM were computed on Y axis. Opinion scores are in the range of 1 to 10. It was assigned by taking the average of the ratings provided by both of us. The score represents the degree of resemblance between the generated image and ground truth image to a human eye. A score of 1 would mean it completely resembles the bicubic interpolated image and 10 would mean it completely resembles the ground truth image.
Results are reported in Table I,II,III,V.
State-of-the-art methods performance comparison on PSNR, SSIM, and Opinion Score. HAT-L performs best and HATFIR has comparable performance.
Below are some interesting result samples
Although all the outputs produced look similar at first glance, however on a closer look one can observe that HATFIR and HAT-L are able to distinguish between the wires better than other models. SwinFIR makes for a close second
HAT-L performs the best on this image. It is able to capture the 2 thin wires running parallel and forming a cross pattern on the glass. Whereas, other models fail to discriminate between the 2 wires and represent them as a single thick wire.
Bottom right part of the images generated by SwinIR and SwinIR Light appears to be significantly blurred. Other models capture the cube pattern better. Surprisingly, HATFIR performs better than HAT-L by eradicating the blur almost completely.
None of the models were able to super resolve the number 85. Interestingly they also failed to capture the fine cross-shaped material texture found in the ground truth image.
HAT-L and HATFIR capture the dashed line pattern on the left side of the image. Whereas other models struggle to get it right by either blurring the dashes or combining them into one.
All models perform similarly on this image. However, on a closer look at the front left leg of the zebra around the black circular patch, we notice that all models except SwinIR and SwinIR Light hallucinate an incorrect pattern of stripes (facing upwards).
The survey aimed to compare the best available super-resolution techniques by testing them on a fixed set of datesets (Set5, Set14, PIRM, Urban) and evaluate them not just on PSNR and SSIM, but also on MOS. After conducting the experiments we made the following observations,
[1] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong. "Activating More Pixels in Image Super-Resolution Transformer", CVPR 2023
[2] Alex Shang, Yabin Zheng, Mary Bennion, and Alex Avramenko. Super Resolution on Arm NPU
[3] Daniel Glasner, Shai Bagon, Michal Irani. "Super-Resolution from a Single Image", ICCV, 2009
[4] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. "Learning a Deep Convolutional Network for Image Super-Resolution", ECCV, 2014
[5] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, Radu Timofte. "SwinIR: Image Restoration Using Swin Transformer", ICCV 2021
[6] Dafeng Zhang, Feiyu Huang, Shizhuo Liu, Xiaobing Wang, Zhezhu Jin. "SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution", 2022
[7] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. "Image super-resolution using very deep residual channel attention networks", In Proceedings of the European Conference on computer vision (ECCV), pages 286-301, 2018., ICCV 2021
[8] Jinjin Gu and Chao Dong. "Interpreting super-resolution networks with local attribution maps", In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9199-9208, 2021.
[9] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong. GitHub Repo - HAT
[10] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, Radu Timofte. GitHub Repo - SwinIR
[11] Dafeng Zhang, Feiyu Huang, Shizhuo Liu, Xiaobing Wang, Zhezhu Jin. GitHub Repo - SwinFIR
[12] M. Bevilacqua, A. Roumy, C. Guillemot and ML. Alberi. Set5 - ”Low- Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding”, BMVC 2012.
[13] Zeyde, R., Elad, M., Protter, M. (2012) Set14 - On Single Image Scale- Up Using Sparse-Representations. In: Boissonnat, JD., et al. Curves and Surfaces. Curves and Surfaces 2010. Lecture Notes in Computer Science, vol 6920. Springer, Berlin, Heidelberg.
[14] Blau, Y., Mechrez, R., Timofte, R., Michaeli, T., Zelnik-Manor, L. PIRM - ”The 2018 PIRM Challenge on Perceptual Image Super-resolution.” ECCVW, 2018
[15] Jia-Bin Huang, Abhishek Singh, Narendra Ahuja. Urban100 - ”Single Image Super- Resolution From Transformed Self-Exemplars”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5197-5206
Varun Mohan (mohan.va@northeastern.edu) and Wenyu Zhang (zhang.wenyu1@northeastern.edu)