Anomaly Detection: UFlow

Anomaly Detection: UFlow
Written by
Matías Tailanian
Published on
June 3, 2024

Author: Matías Tailanian, Team Leader @ Digital Sense

This article presents our new method for Anomaly Detection in images. It is focused on a setting of defect detection in an industrial environment, but, as we show in this work, the methods also exhibit excellent performance in very different fields, such as surveillance and medical imaging. 

Here is the article, in case you want to directly read it:

Here is an online viewer:

The official code is

Also, it has been added to the awesome Anomalib library!:

Fig. 1: Anomalies detected with the proposed approach on MVTec-AD examples from different categories. Top row: original images with ground truth segmentation. Middle and bottom rows: corresponding anomaly maps and automatic seg- mentations.

What is Anomaly Detection in images?

  • The task of detecting images or regions of the images that are not "normal" (we will define later what "normal" means).
  • An unsupervised task. What is special about this task is that we usually try to characterize normality and detect, as an anomaly, everything that lay outside this characterization.

It is formulated as a one-class problem, where we only train using normal images. Note that anomalies are rare structures by definition, and therefore, it would be very difficult or even impossible to gather a good amount of data to train a supervised method.

This setting is perfect for an industrial use case where we want to detect defects in the production line, as we do not know how the anomalies can look like, and every anomaly could be different.

What is U-Flow?

U-Flow is a one-class self-supervised method for anomaly segmentation in images that benefits both from a modern machine learning approach and a more classic statistical detection theory. 

It makes use of Normalizing Flows, a type of Generative Model that maps the distribution of the normal data to a predefined known distribution, i.e. the Gaussian distribution.

Fig 2. The method consists of four phases. (1) Multi-scale feature extraction: a rich multi-scale representation is obtained with MS-CaiT by combining pre-trained image Transformers acting at different image scales. (2) U-shaped Normalizing Flow: by adapting the widely used U-like architecture to NFs, a fully invertible architecture is designed. This architecture is capable of merging the information from different scales while ensuring independence in both intra- and inter-scales. To make it fully invertible, split and invertible up-sampling operations are used. (3) Anomaly score map generation: an anomaly map is generated by associating a likelihood-based anomaly score to each pixel in the test image. (4) Anomaly segmentation: besides generating the anomaly map, we also propose to adapt the a contrario framework to obtain an automatic threshold by controlling the allowed number of false alarms.

Our method takes an image as input and generates two different outputs:

  1. An anomaly score map. Each pixel in the image is assigned an anomaly score. The greater the score, the more anomalous it is.
  2. As in many real-world applications, we need an actual segmentation of the anomalies, the score map might not be enough. Therefore, we also generate as output, a segmentation mask, where pixels in white are detected as anomalous. In order to do so, we employ the a contrario methodology, as explained below.

More precisely, U-Flow consists of four phases. First, features are extracted using a multi-scale image Transformer architecture. Then, these features are fed into a U-shaped Normalizing Flow (NF) that lays the theoretical foundations for the subsequent phases. The third phase computes a pixel-level anomaly map from the NF embedding, and the last phase performs a segmentation based on the a contrario framework. This multiple-hypothesis testing strategy permits the derivation of robust unsupervised detection thresholds, which, as we mentioned, are crucial in real-world applications where an operational point is needed. 

The segmentation results are evaluated using the Mean Intersection over Union (mIoU) metric, and for assessing the generated anomaly maps we report the area under the Receiver Operating Characteristic curve (AUROC), as well as the Area Under the Per-Region-Overlap curve (AUPRO). Extensive experimentation in various datasets shows that the proposed approach produces state-of-the-art results for all metrics and all datasets, ranking first in most MVTec-AD categories, with a mean pixel-level AUROC of 98.74%.

Moreover, we also tested U-Flow in a very different setting than industrial defect detection (in medical imaging and on a surveillance dataset), and it also showed excellent performance, too!


Phase 1: Feature Extraction

Since anomalies can emerge in various sizes and forms, collecting image information at multiple scales is essential. To do so, the standard deep learning strategy is to use pre-trained CNNs, often a VGG [36] or any variant of the ResNet [37] architectures, to extract a rich multi-scale image feature representation by keeping the intermediate activation maps at different depths of the network. More recently, with the development of image vision Transformers, some architectures such as ViT [38] and CaIT [21] are also being used, but in these cases, a single feature map is retrieved. The features generated by vision Transformers are proven to better compress all multi-scale information into a single feature map volume, compared to the standard CNNs. However, by obtaining a multi-scale feature hierarchy from the Tranformers, these features can be further enhanced. In our work, we propose MS-CaiT, a multi-scale Transformer architecture that employs CaIT Transformers at different scales, independently pre-trained on ImageNet [14], and combining them as the encoder of a U-Net [41] architecture. Despite its simplicity, an ablation study presented in Section 5.2 of the paper, shows that this combination strategy leads to better results

Phase 2: Normalizing Flow

Normalizing Flows [42] is a family of generative models that are trained by directly maximizing the log-likelihood of the input data and which have the particularity of learning a bijective mapping between the input distribution and the latent space. Using a series of invertible transformations, the NF can be run in both directions. The forward process embeds data into the latent space and can serve as a measure of the likelihood. The reverse process starts from a predefined distribution (usually a standard Normal distribution) and generates samples following the learned data distribution.

Picture taken from, an excellent blog for getting started with Normalizing Flows!

The rationale for using NFs in an anomaly detection setting is straightforward. The network is trained using only anomaly-free images, and in this way, it learns to transform the distribution of normal images into a white Gaussian noise process. At test time, when the network is fed with an anomalous image, it is expected that it will not generate a sample with a high likelihood, according to the white Gaussian noise model. Hence, a low likelihood indicates the presence of an anomaly.

State-of-the-art methods following this approach are centered on designing or trying out different multi-scale feature extractors. In this work, we further improve the approach by proposing a new feature extractor and a multi-level deep feature integration method that aggregates the information from different scales using the well-known and widely used UNet-like [41] architecture. The U-shape comprises the feature extractor as the encoder and the NF as the decoder.


The U-shaped NF is compounded by a number of flow stages, each one corresponding to a different scale, whose size matches the extracted feature maps (see Figure 2). For each scale starting from the bottom, i.e., the coarsest scale, the input is fed into its corresponding flow stage, which is essentially a sequential concatenation of flow steps. The output of this flow stage is then split in such a way that half of the channels are sent directly to the output of the whole graph, and the other half is up-sampled to be concatenated with the input of the next scale, and proceed in the same fashion. The up-sampling is also performed in an invertible way, as proposed in [43], where pixels in groups of four channels are reordered in such a way that the output volume has four times fewer channels and double width and height.

To sum up, the U-shaped NF produces L white Gaussian embeddings

Formula z1.-zn

, one for each scale l,

Formula 2

Here, we denote by (i, j) the spatial location and by k the channel index, in the latent tensors. Its elements

Formula 3

are mutually independent for any position, channel, and scale i,j,k,l

Phase 3: Likelihood-based Anomaly Score

This phase of the method is to be used at test time when computing the anomaly map. It takes as input the embeddings

and produces an anomaly map based on the likelihood computation. Thanks to the statistical independence of the features produced by the U-shaped NF, the joint likelihood of a test image under the anomaly-free image model is the product of the standard normal marginals. Therefore, to build this map, we associate to each pixel (i,j) in the test image a likelihood-based Anomaly Score similar to those in [19, 35]:


Phase 4: A contrario anomaly segmentation

The last phase of U-Flow is designed to obtain a segmentation of the anomalies, as in many real-world applications this is usually a requirement. To do so, we employ the a contrario methodology, and compute the Number of False Alarms (NFA) associated with each pixel in the image. This methodology permits to derive an automatic threshold over the NFA, wich also has a clear statistical meaning: it is the uppber bound on the number of times that a certain structure can occur under normality conditions.

Anomalies can occur at any position in the image, with arbitrary shape and size, but to evaluate all these possibilities is infeasible in a reasonable time. Therefore, to efficiently detect them, we designed a hierarchical tree-based algorithm, based on the upper level sets of the anomaly score defined in Phase 3. This tree is used to select the candidate regions, with their associated NFA value. 

The details of this algorithm are presented in this post, and of course in the paper.


The proposed approach was tested and compared with state-of-the-art methods, by conducting extensive experimentation on several datasets. Regarding the industrial inspection task, which is the motivation and focus of this work, we used MVTec-AD [23] (the most widely used dataset by the community), and BeanTech (BT) [24]. Besides testing the proposed method for this task, we also include experimentation with data from other fields to demonstrate the generalization capability of our method. To do so, we test and compare our method on datasets from the medical, and surveillance fields; LGG- MRI (MRI) [25] and ShanghaiTech Campus (STC) [26], respectively.

For assessing the anomaly maps defined in (2), we adopt the AUROC metric (the area under the Receiver Operating Characteristic curve) at a pixel level, as it is the most widely used metric in the literature. For assessing the anomaly masks (after thresholding), we use the mIoU metric (Mean Intersection over Union).

We also consider AUPRO (the area under the per-region-overlap curve) for the anomaly maps, since it ensures that both large and small anomalies are equally important. And even though the image-level anomaly detection task is not the primary focus of this work, we have included the image-level AUROC results. Both results can be found in the original paper, but are excluded from this article.

Table 1: MVTec-AD results: pixel-level AUROC. The two best results for parison with the best performing methods: Patch SVDD [27], SPADE [29], Patch Core [31], PEFM [62], Fast Flow [19], CFlow [35], and CS-Flow [20]. previous methods, with an average value of 98.74%.

For MVTec, the results obtained for the anomaly maps assessment in terms of AUROC are presented in Table 1. U-Flow achieves state-of-the-art results, outperforming all previous methods on average for AUROC. In addition, besides obtaining excellent results for AUROC (and AUPRO) in the anomaly localization task, our method presents another significant advantage with respect to all others: it produces a fully unsupervised segmentation of the anomalies and significantly outperforms its competitors, as shown in Table 2.

Table 2: Segmentation mIoU comparison for MVTec-AD, with the best flow-based methods in the literature: FastFlow [19], CFlow [35], and CS-Flow [20], for the oracle-like and fair thresholds defined in Section 4.1.1. Our method largely outperforms all others, and even exhibits a better performance comparing the proposed automatic threshold with their oracle-like threshold. 

We report results on anomaly segmentation based on an unsupervised threshold obtained by setting NFA = 1 (log(NFA) = 0). This threshold means that, in theory, we authorize, on average, at most, one false anomaly detection per image.

As the state-of-the-art methods to which we compare do not provide detection thresholds, we adopt two strategies: (i) we compute an oracle-like threshold that maximizes the mIoU for the testing set, and (ii) we use a fair strategy that only uses training data to find the threshold. In the latter, the threshold is set to allow at most one false positive in each training image, as it would be analogous to setting NFA = 1 false alarm. As seen in Table 3, our automatic thresholding strategy significantly outperforms all others, even when compared with their oracle-like threshold.

Visual results

Example results for all MVTec categories. The first row shows the example images with the ground truth over-imposed in red. The results for FastFlow, CFlow, and CS-Flow are shown in the second, third, and fourth rows. The last two rows correspond to our method: the anomaly score defined in (2), and the segmentation obtained with the automatic threshold log(NFA) < 0. While other methods achieve a very good performance, in some cases, they present artifacts and over-estimated anomaly scores. Our anomaly score achieves very good visual and numerical results, spotting anomalies with high confidence. Finally, the segmentation with the automatic threshold on the NFA is also able to spot and segment the anomaly accurately. 

Normal image examples for all MVTec-AD categories. As can be seen, we always predict low values in the anomaly maps, and no detections are made. 

Results on other datasets

As mentioned above, we also tested our method in other datasets, even from fields different from the industrial setting. The results on the other datasets are also excellent, both for the generated anomaly map, and for the automatic segmentation. 

The results reported here demonstrate the robustness and generalization capability of the proposed approach. For all datasets, we obtained excellent results, reaching top performance for almost all metrics and datasets.

Results for LGG MRI [25], ShanghaiTech Campus (STC) [26], and BeanTeach (BT) [24] datasets. Comparison with best-performing flow-based models: FastFlow [19], CFlow [35], and CS- Flow [20]. Our method obtains the best results for almost all metrics and datasets, outperforming the other methods. Both AUROC and AUPRO refer to the pixel-level metric (localization task). For mIoU, we present the results using both the “Fair” and the “Oracle”-like thresholds.

Examples of typical results on BeanTech, LGG MRI, and STC datasets. The first row shows the original images with over-imposed ground truth. The second, third, and fourth rows show the results of FastFlow, CFlow, and CS-Flow for comparison. The last two rows are the outputs of our method: the anomaly score defined in (2), and the anomaly segmentation with log(NFA) < 0. Again, our anomaly score achieves very good visual and numerical results and is able to detect the anomalies with great confidence. 

Normal images examples of BeanTech, LGG MRI, and STC. 


[14] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015).

[19] Yu, J., Zheng, Y., Wang, X., Li, W., Wu, Y., Zhao, R., Wu, L.: Fastflow: unsupervised anomaly detection and localization via 2d normalizing flows. arXiv preprint arXiv:2111.07677 (2021)

[20] Rudolph, M., Wehrbein, T., Rosenhahn, B., Wandt, B.: Fully convolutional cross-scale-flows for image-based defect detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1088–1097 (2022)

[21] Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42 (2021)

[23] Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: Mvtec ad—a comprehensive real-world dataset for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9592–9600 (2019)

[24] Mishra, P., Verk, R., Fornasier, D., Piciarelli, C., Foresti, G.L.: Vt- 29. adl: A vision transformer network for image anomaly detection and localization. In: 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), pp. 01–06 (2021).

[25] Buda, M., Saha, A., Mazurowski, M.A.: Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm. Comput. Biol. Med. 109, 218–225 (2019)

[26] Liu, W., W. Luo, D.L., Gao, S.: Future frame prediction for anomaly detection—a new baseline. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[27] Yi, J., Yoon, S.: Patch svdd: Patch-level svdd for anomaly detection and segmentation. In: Proceedings of the Asian Conference on Computer Vision (2020)

[29] Cohen, N., Hoshen, Y.: Sub-image anomaly detection with deep pyramid correspondences. arXiv preprint arXiv:2005.02357 (2020)

[31] Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14318–14328 (2022)

[35] Gudovskiy, D., Ishizaka, S., Kozuka, K.: Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 98–107 (2022)

[36] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

[37] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

[38] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

[41] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Springer (2015)

[42] Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538 (2015)

[43] Jacobsen, J.-H., Smeulders, A., Oyallon, E.: i-revnet: Deep invertible networks. arXiv preprint arXiv:1802.07088 (2018)

[62] Wan, Q., Cao, Y., Gao, L., Shen, W., Li, X.: Position encoding enhanced feature mapping for image anomaly detection. In: 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), pp. 876–881 (2022)