Tech

Deep Learning for Satellite Image Super-Resolution: Mainstream Methods in the State of the Art

Published on

September 5, 2025

In a previous post, we provided a general overview of super-resolution methods through the specific lens of their application to satellite imaging. Our article A Comprehensive Introduction to Super Resolution for Satellite Images includes a quick review of the theoretical basis for super resolution.

This is the first (part one) of two posts where we’ll expand specifically on deep learning methods for super resolution.

For this purpose, we will focus on providing examples for different types of super resolution methods including some published papers authored in collaboration with members of our team ([1], [2]). Moreover, we will provide discussion on the caveats of the data needed for satellite image super resolution and available datasets only, while also providing real examples of super resolved images with some of the presented models.

What is super-resolution and why is it important?

Super-resolution technology plays a crucial role when it comes to satellite imagery. It's an advanced image processing technique used to enhance image resolution. It reveals clearer and more detailed results, effectively reconstructing finer details that might be missing in the original, lower-resolution image. This is particularly valuable for satellite imagery sources which are limited in resolution. Satellites play a vital role in numerous applications, including environmental monitoring, urban planning, defense, and disaster response. However, the data they provide can be restricted by resolution. Super-resolution addresses this issue, enhancing the quality and usefulness of satellite imagery and making it even more valuable for a wide range of applications. In the following image some examples of super resolved images are shown.

‍

original lower resolyution image — *Illustration of original lower-resolution satellite images (top) vs super-resolved outputs (bottom) [1].*

‍

Single-image vs multi-image and multi-date super resolution

As their name indicates, super resolution methods can either use a single image (single-image) or multiple images of the same scene (multi-image) as input.

Single-image methods use algorithms that analyze the patterns, textures, and stuctures in the image to predict and generate higher-resolution versions.

Multi-image methods are useful when there are slight variations between the images (e.g., due to motion or capture angles), which can provide more data samples to reconstruct a better, single higher-resolution image. An important advantage of these methods is that they can recover actual high-resolution details from the scene without hallucinating content.

Multi-image methods exploit the fact that the multiple images from a single scene are taken with variations in position of the sensor. Placing these samples in a virtual higher-resolution grid that aggregates the multiple images enables reconstruction of a higher-resolution image.

Sub-pixel accurate image corregistration is usually a necessary component of classical multi-image methods that is used to place samples in a single higher-resolution grid, although in most deep learning methods this corregistration is implicit.

A common pitfall of multi-image methods is that they can produce artifacts in cases where objects are occluded or move in a manner that is not consistent with the global motion between scenes (or local scene regions).

Multi-date methods are a type of multi-image method where the input images are captured days instead of milliseconds apart. They are more challenging due to having more changes present in the scene.

Supervised vs self-supervised methods

In general, deep learning models need some kind of supervision for training. This means that the output of a deep neural network is compared to some other reference data (a.k.a ground truth) that helps improve the model accuracy during the training process. In the case of deep learning super resolution this supervision can be found in two main flavours: fully supervised or self-supervised.

Fully supervised methods: these use low/high resolution image pairs to train the network. The network is fed with one or more low resolution images and the output of the network is compared to the high resolution reference images during training.

Self-supervised methods: these use only low resolution data to supervise the training of a restoration network. These methods are trained to predict a high-resolution version of the image that is consistent with a degradation model [3] or other (low-resolution) observations of the same scene [4].

In simulation-based self-supervision methods only high resolution images are needed, an image formation model is used to generate downscaled versions of the images, the deep-learning model is then fed with downscaled versions of these images and trained to predict the high-resolution counterparts. The fidelity of the image formation model to the real data will determine the quality of the restoration results, as any error will result in a bias.

Data collection is a major challenge in any deep learning project and for satellite images this challenge is even harder. To improve the resolution of a particular satellite in a fully supervised manner, higher resolution ground truth data is necessary. This is typically data from other higher-resolution satellites or from aerial images, taken under different conditions than the images to super-resolve. Moreover, when high and low resolution images come from different sources, the differences in sensor characteristics (other than spatial resolution) need to be accounted for to avoid unwanted inductive biases. For example: one satellite’s image could be mistakenly mapped to the different spectral profile of the higher-resolution satellite.

Self-supervised methods have a clear advantage over supervised ones in terms of data requirements, but they can be very sensitive to the image formation model (in the case of simulation based methods), or only be applicable to particular cases like multi-image models with burst mode image sources. Furthermore, image formation models in the case of satellite images can be very impractical or impossible to correctly formulate due to the complexity of the many elements involved.

Choosing the right approach depends on the availability of data and the specific application needs.

‍

Regression vs Generative Deep Learning Super Resolution

Deep learning methods can be divided mainly into two categories:

Methods based on regression: they learn to produce a weighted average of all possible outputs.

Methods based on conditional generative models like Generative Adversarial Networks (GANs), Normalizing Flows (e.g. SRFlow), or (Latent) Diffusion Models.

We illustrate these categories with some examples in this section.

Regression-based Methods

SRCNN was the first example of a regressor applied to super resolution. This network has a simple architecture with 3 main layers. The first extracts features from the low-resolution input image, the second performs a non-linear mapping of the features to a high-resolution feature space, and finally, the high-resolution output is reconstructed from the high-resolution features (see figure below). Due to its simplicity and effectiveness SRCNN became the foundation for more advanced deep learning-based super-resolution methods.

The same authors later proposed Faster SRCNN [5] (FSRCNN) was proposed to enhance the speed of the previous model and achieve real time speed (24fps). FRSCNN improves on SRCNN by adding more steps to the super resolution model while generating a more efficient pipeline. The main steps can be seen in the image below. Two key differences for the efficiency of this new process is that FRSCNN works directly on the low resolution image using convolutions with smaller filters and applies a shrinking step with 1x1 convolutions to decrease the number of channels of activation maps. By enhancing the original architecture of SRCNN, FRSCNN is able to deliver both improved quality and speed on super resolution. Authors claim up to a 40x speed up over the original architecture while having improved PSNR results over similar methods.

‍

*SRCCN vs FSRCNN processing steps* *[5]*.

Generative Methods

One of the main motivations to employ generative methods is that they can provide sharper and more natural-looking images than MSE-minimizing CNNs. This is because the latter produce reconstructions that represent a pixel-wise average of the many possible high-resolution solutions for the low-resolution input, and thus blur together these many possible solutions. In contrast, generative methods generate a single high-resolution example from the space of possible solutions.

An important con of generative models, particularly when super resolving above a 2x factor is the possible introduction of hallucinated content: details that could even look plausible, but that were never in the original scene. This is a result of the underlying physics of image formation. The information on details above that 2x factor never made it to the low-resolution image, and thus any detail finer than that is introduced by the network based on its training data. You can find a reminder on image formation physics on our introductory post to super resolution.

Generative models can push the reconstruction resolution well beyond the limits imposed by sampling theory, with some vendors offering 10x resolution increases. In these extreme cases, it must be understood that what the models do is mostly invent (hallucinate) content or in-paint the reconstruction with archived content. For example, you may find in the reconstruction a partially-built road that has now been completed, but it’s shown in the same state as in some older aerial survey images used to guide the 10x enhancement.

Generative Adversarial Networks (GANs)

The first example of super resolution with GANs is SRGAN [6]. Super-resolution GANs are trained to produce a high-resolution image that can fool a discriminator network which decides whether an image looks realistic enough or not given the source train images.

‍

*Architecture of SRGAN with kernel size (k), number of feature maps (n) and stride (s) indicated for each convolutional layer* *[6]*.

Another important work in deep learning super resolution is ESRGAN [7] (Enhanced SRGAN) which provides modifications to the original SRGAN to generate enhanced super resolution images. These resulted in improved visual quality for generated images to previous methods. These modifications include using a new image generator architecture named Residual-in-Residual Dense Blocks (RRBD), using a Relativistic GAN as the discriminator and adding a more effective perceptual loss. ESRGAN is still used to this day as an effective method for superresolution and has also been used for satellite images.

GANs are notoriously hard to train and can suffer from mode-collapse (networks converge to a very limited set of results). More recently normalizing flows (e.g. SRFlow [8]) and diffusion model methods (e.g. SR3 [9]) have been proposed addressing these issues. They also provide higher quality outputs.

Normalizing Flows

Normalizing flows are a set of deep learning algorithms that enable mapping intractable distributions, like latent features from neural networks, to parametric probability distributions like normal gaussian distributions. This is useful for learning interpretable probability distributions on complex data.

SRFlow [8] is a super resolution method based on normalizing flows that proposes to address the limitations of GANs and CNNs by learning the conditional distribution of plausible high resolution images given a low resolution image.

To create the normalizing flow SRFlow uses a two step approach: a multilevel image encoder (g₀) and an invertible flow network (f₀) to normalize image features. For the image encoder (g₀) a CNN architecture based on RRDB proposed in ESRGAN paper is used. The flow network at each level is based on GLOW [22] normalizing flow. One of the benefits of SRFlow is that it can be trained using only log-likelihood as loss function.

‍

SRFlow architecture showing g0 the low resolution image encoder and f0 the invertible flow network. — *SRFlow architecture showing* g0 the low resolution image encoder and f0 the invertible flow network.

‍

At the moment of publication SRFlow outperformed GAN based methods on PSNR and perceptual quality metrics.

Flow-based methods and particularly SRFlow map the high resolution space of images into a parametric distribution and this mapping is also invertible. This means that:

The model not only generates one image conditioned to the low resolution images but a distribution where images can be sampled.

It provides a framework for image manipulation where you can follow the reverse flow process and encode a high resolution image into the latent parametric space. SRFlow provides examples of how to do style transfer, content transfer and image restoration with their trained network by using this process.

Some examples of normalizing flows applied to remote sensing can be found in [22] and [23].

Diffusion Models

Diffusion models are family generative models that are able to synthesize images by iterative denoising steps inspired by the physics process diffusion. Although a diffusion inspired deep learning model was first proposed in [10], image generation with diffusion models gained popularity after the publishing of Denoising Diffusion Probabilistic Models (DDPM) [11].

SR3 (Super Resolution via Repeated Refinement) [9] is a diffusion model inspired by DDPM that enables generating high resolution super-resolved images via a diffusion process conditioned to low resolution input images. Diffusion models are usually composed of two processes: a forward process where noise is iteratively added to an image until it transforms into white noise, and the reverse process where white noise is denoised in iterative steps until it creates a realistic image. In the case of SR3 this reverse process is conditioned to a low resolution image in order to build a high resolution image of the input image.

‍

*Representation of forward and reverse (right to left) diffusion process from SR3[10]. The reconstruction process is conditioned to an input* x.

‍

SR3 reports to have better performance than other methods such as GANs particularly in perceptual metrics and perceptual challenges like fooling humans into choosing generated images as more realistic than ground truth images. SR3 was originally designed and trained for natural images, but it has been used successfully with satellite images. It is even available within the arcgis.learn module of the ArcGIS Python API.

‍

Accelerated Diffusion Models

Traditional diffusion models typically require ~1000 steps, and evaluate a large score network at each step, leading to slow inference times. More recently this computational cost has been significantly reduced by essentially two techniques

Latent Diffusion models [12](e.g. Stable Diffusion) reduce costs by running the diffusion process in the much lower dimensional latent space. As an added bonus these methods can be optionally conditioned on a text prompt.
Distilled diffusion models [13] and consistency models [14] start from a (possibly latent) diffusion model of 1000 steps as a teacher to train a student model that can run in a much smaller number of steps.

Notably DMD2 [15] combines both approaches leading to generated images with SOTA quality with only 8, 4, 2 or even 1 iteration of the diffusion process in the latent space.

Like in the previous section, Distilled and/or Latent Diffusion Models have been repurposed to solving inverse problems like super-resolution, in two different ways:

Training-free guidance of the generative process slightly modifies the score network without retraining, so that it solves the desired inverse problem. Notable examples are DPS [16], DiffPIR [17], or their equivalents in the latent space like Latent DPS [12], [18].

Fine tuning of the score network with paired low-res/high-res images, either using ControlNet for distilled diffusion models like in CoSIGN [19], or a specific architecture for latent diffusion models like in SILO [20].

The first approach that leverages the computational advantages of both latent and distilled models in the context of inverse problems via training-free guidance is LATINO-PRO [21]. This recent model can solve SR tasks (x2 up to x32 upscaling factor) with a target resolution of 1024x1024 with only 8 NFEs, e.g. 5 seconds on a A100 GPU, and obtains SoTA performance in terms of FID, PSNR, LPIPS.

‍

Latino Pro image process — LATINO-PRO [21] iteratively improves the degraded image by i) encoding, ii) diffusion in the latent space, iii) decoding and iv) training-free guidance (for consistency with the low-res image). This process is iterated 4 times.

Conclusions

As satellite data becomes central to decision-making across sectors, SR offers a way to extract more insight, precision, and value from existing assets—without the need for costly re-tasking or higher-resolution satellites.

In this first part of our two-part post, we've explored the foundational building blocks of deep learning-based super-resolution—from early regression models like SRCNN, to powerful generative methods including GANs, normalizing flows, and diffusion models.

Each category brings unique strengths: regression-based models offer speed and stability, while generative approaches produce sharper results, albeit sometimes at the cost of introducing hallucinated content. We also highlighted a key axis of differentiation between supervised and self-supervised training methods, emphasizing the data challenges specific to remote sensing and satellite imagery. In particular, the difficulty of collecting well-aligned high-resolution ground truth data makes self-supervised and simulation-based approaches an attractive—though technically demanding—alternative.

The strategic opportunity is clear: organizations that invest in smart, mission-aligned SR now will gain sharper insight and faster time-to-decision across their geospatial workflows.

In the next post, we will dive deeper into some methods specifically tailored for satellite images as well as sharing practical examples. Stay tuned.

‍

Learn more

📍 Our Services:
See how we help industrial and space clients at www.digitalsense.ai

📞 Schedule a call:
For decision-makers looking to optimize operations using satellite data and AI, Digital Sense offers full-cycle consulting, from prototype to production-grade deployment. Contact us

Research Driven, Results Focused. That’s Digital Sense.

‍