PROMOCIJAS DARBS
Rīga
2025
FACULTY OF SCIENCE  
AND TECHNOLOGY
DEPARTMENT OF COMPUTING
Maksims Ivanovs
DEEP LEARNING FOR APPLIED COMPUTER VISION: 
SOLVING IMAGE UNDERSTANDING TASKS WITH 
CONVOLUTIONAL NEURAL NETWORKS
DOCTORAL THESIS
Field: Computer Science
Subfield: Artificial Intelligence
Thesis Advisor:
Senior Researcher, Dr. sc. ing. Roberts Kadiķis
Riga 2025
Abstract
This PhD thesis focuses on the application of deep learning methods to solving key image
understanding tasks: image classification, object detection, and semantic segmentation.
In the literature review, I provide the background for the experimental work presented in
the rest of the thesis and establish that the most promising type of deep neural architecture
for image understanding tasks is convolutional neural networks (CNNs). I also discuss the
challenges related to the availability of data for training CNNs and outline how transfer
learning and synthetic data can be leveraged as potential solutions to mitigate this issue.
The practical part of the thesis focuses on various applications of CNNs to real-world
image understanding problems.
In Chapter 2, I describe the use of CNNs for recognising hand-washing movements, with
the goal of designing a system to monitor hand hygiene. This work led to the collection of
the largest dataset of labelled hand-washing videos to date; subsequently, lightweight CNN
models were trained on it. Although the accuracy of these models was not yet sufficient for
implementing a real-world monitoring system, the experimental findings represent a signifi-
cant step toward that goal. Additionally, the experiments with CNNs reported in Chapter 2
highlight the differences between simplified datasets and complex, real-world datasets.
In Chapter 3, I report the study on the use of CNNs for semantic segmentation of street
views, an essentials task for the perceptual module of self-driving cars. The results indicate
that augmenting real-world datasets with photorealistic synthetic images can significantly
improve the accuracy of semantic segmentation, as the best MobileNetV2 and Xception-65
models trained on the augmented data outperformed their counterparts trained solely on
real-world images.
In Chapter 4, I present the use of CNN-based object detectors for identifying plastic
bottles suitable for being picked up by a robotic arm. The object detectors were trained on
synthetic images generated using Blender; by enhancing these images with Generative Adver-
sarial Networks (GANs), the mean average precision of object detection improved compared
to the models trained on the original synthetic images.
In Chapter 5, I describe the use of CNNs to classify images of cell tissue with the goal of
automating the growth of organs-on-a-chip (OOC). The accuracy of the EfficientNet-B7 and
MobileNetV3Large classifiers was improved by training them on datasets augmented with
synthetic data generated with the Stable Diffusion large generative model.
The goal of the thesis – to provide efficient solutions for applied image understanding tasks
– was achieved for all tasks except the classification experiments on the PSKUS dataset,
the most complex and noisy dataset of hand-washing videos. Additionally, the findings
contributed to a better understanding of some methodological challenges in deep learning,
such as augmenting real-world datasets with synthetic data. The results reported in the
thesis have been published in six scientific articles indexed in Elsevier Scopus and/or Web
of Science databases and presented at four conferences. The approbation of the results was
conducted in seven research projects at the Institute of Electronics and Computer Science
(EDI), where this thesis was developed. The results support the four thesis statements that
I propose for the defence.
ii
To my family.
iii
Acknowledgements
I would like to express my deepest gratitude to a number of people whose support and
encouragement have been pivotal in bringing this thesis to completion.
First and foremost, I would like to thank my thesis advisor, Dr. Roberts Kadik¸is, for his
invaluable guidance, expertise, and patience. His insights and advice have been essential in
shaping my research career in general and this thesis in particular.
I am immensely grateful to my family for their unwavering support throughout this jour-
ney: to my wife Ilze, for her love, patience and companionship; to my daughters Laura, Alise,
Adria¯na, Emı¯lija, and Karol¯ına, who fill my life with joy and inspiration; to my mother N¸ina,
for her belief in my success.
A special thank you goes to my colleagues at the Institute of Electronics and Computer
Science (EDI), especially Dr. Modris Greita¯ns and Dr. Kaspars Ozols for creating a support-
ive research environment, and Dr. Atis Elsts, Dr. Ja¯nis Judvaitis, and fellow PhD candidates
Kriˇsja¯nis Nesenbergs, Didzis Lapsa, and Anatolijs Zencovs for the opportunities to engage
in stimulating discussions.
I am very grateful to all the co-authors of my publications, with whom I have had the
pleasure and honour to collaborate. Their contributions have been instrumental in my re-
search.
I would also like to acknowledge my colleagues at the University of Latvia (UL), especially
Professor Juris Borzovs, Professor Zane Bicˇevska, Professor Ja¯nis Zuters, Professor Laila
Niedr¯ıte, and Professor Guntis Arnica¯ns, for providing me with the opportunities to teach
at UL and invaluable advice and kind help in various academic and administrative matters.
I wish to thank the administrative staff at the UL, particularly A¯rija Spro ‘ge, Dace Mileika,
and the late Anita Ermusˇa, for their kind assistance in various organisational matters. I am
also very grateful to Professor Ja¯nis Zuters and Associate Professor Edgars Celms for being
examiners at my PhD exam, to Professor Uldis Straujums, Professor Guntis Arnica¯ns, and
Professor Juris Borzovs for providing valuable feedback during the public discussion of an
earlier version of this thesis, and to Associate Professor Jevge¯nijs Vihrovs for kindly sharing
with me the LATEX template of his PhD thesis, thus saving me a lot of time and effort on
formatting this work.
I extend my heartfelt thanks to my students at the University of Latvia, who have greatly
contributed to my development as an educator, thus also making me a better researcher.
Last but certainly not least, I wish to thank my dogs Pe¯rle, Kara (oh, sweet little Kara!),
Lesija, and Bella, and my pet parrots Rons, Lira, Solo, Frodo, and the late Taira for their
companionship and the joy they bring into my life. They have been a source of comfort and
relief for me during the most stressful times.
This academic journey has been challenging, rewarding, and at times, challenging once
more, and I am thankful to have had such a supportive network of family, colleagues, students,
and friends. Once again, thank you all!
iv
Contents
List of abbreviations vii
Introduction 1
1 Background 8
1.1 Computer vision: definition, scope, and highlights . . . . . . . . . . . . . . . 8
1.2 Image understanding: definition, main tasks,
metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Methods for solving image understanding tasks . . . . . . . . . . . . . . . . 17
1.3.1 Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Deep learning-based methods . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Datasets for image understanding tasks . . . . . . . . . . . . . . . . . . . . . 42
1.4.1 The fundamental role of the data . . . . . . . . . . . . . . . . . . . . 42
1.4.2 Transfer learning and fine-tuning . . . . . . . . . . . . . . . . . . . . 43
1.4.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.4.4 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2 Hand-Washing Movement Classification 48
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Related work: methods for monitoring
hand-washing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3 Hand-washing recording datasets . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.1 PSKUS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 METC dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4 Initial experiments on PSKUS and METC datasets . . . . . . . . . . . . . . 62
2.5 Cross-dataset study of CNN performance . . . . . . . . . . . . . . . . . . . . 66
2.5.1 Datasets and data preprocessing . . . . . . . . . . . . . . . . . . . . . 67
2.5.2 Experiments: methodology and results . . . . . . . . . . . . . . . . . 68
2.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3 Semantic Segmentation of Street Views 76
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2 Related work: datasets and methods for semantic segmentation of street views 78
3.3 Street views datasets for semantic segmentation: Cityscapes, MICC-SRI, and
CCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5 Experiments: methodology and results . . . . . . . . . . . . . . . . . . . . . 84
3.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
v
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 Object Detection for a Bin-Picking Task 92
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Related work: object detection and sim-to-real
translation for robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Initial datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Real-world dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.2 Synthetic dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Experiments: methodology and results . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 Sim-to-real transfer with CycleGAN . . . . . . . . . . . . . . . . . . . 97
4.4.2 Evaluation of datasets generated with CycleGAN with FID score . . 100
4.4.3 Object Detection Experiments . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5 Image Classification for Monitoring the Growth of Organs-on-a-Chip 105
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.1 Synthetic data for training CNN models for biomedical
image understanding tasks . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.2 Large generative models for image synthesis . . . . . . . . . . . . . . 108
5.2.3 Stable Diffusion model for image generation . . . . . . . . . . . . . . 109
5.3 Experiments on the initial OOC image dataset . . . . . . . . . . . . . . . . . 111
5.3.1 The initial OOC image dataset . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Synthetic data for augmenting the initial OOC image dataset . . . . 112
5.3.3 Experiments with CNNs on the initial dataset . . . . . . . . . . . . . 113
5.4 Experiments on the final OOC image dataset . . . . . . . . . . . . . . . . . . 114
5.4.1 The final OOC image dataset . . . . . . . . . . . . . . . . . . . . . . 115
5.4.2 Synthetic data for augmenting the final OOC image dataset . . . . . 116
5.4.3 Experiments with CNNs on the final dataset . . . . . . . . . . . . . . 120
5.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Conclusion 125
Bibliography 131
vi
List of Abbreviations
A549 Human lung adenocarcinoma alveolar basal epithelial cell line
AGI Artificial general intelligence
AI Artificial intelligence
ANN Artificial neural network
AP Average precision
AR Average recall
ASPP Atrous spatial pyramid pooling
BERT Bidirectional encoder representations from transformers
(DNN model)
Caco-2 colorectal adenocarcinoma epithelial cell line
CamVid The Cambridge-driving Labeled Video Database
CARLA Car Learning to Act (driving simulator)
CCM Cityscapes-CARLA Mixed (dataset)
CFG Classifier-free guidance
CIoU Complete Intersection over Union
CLIP Contrastive Language-Image Pre-training
CmBN Cross mini-Batch Normalization
CNN Convolutional neural network
CPU Central processing unit
CRF Conditional Random Field
CSPD Cross-stage partial connections
CycleGAN Cycle-Consistent Generative Adversarial Network
DIoU-NMS Distance IoU Non-Maximum Suppression
DL Deep learning
DNN Deep neural network
DPM Deformable Parts Model
DPM++ 2M Diffusion Probabilistic Model Second-Order Multistep Improved
[Sampler]
ECDC European Centre for Disease Prevention and Control
EDI Institute of Electronics and Computer Science
FID Frechet Inception Distance
FLOP Floating-point operation
FPN Feature Pyramid Network
FPS Frames per second
GAN Generative Adversarial Network
GAP Global average pooling
GPU Graphics processing unit
GRU Gated recurrent unit
GTA Grand Theft Auto (video game)
vii
HPC High performance computing
HPMEC Human pulmonary microvascular endothelial cell line
HSAEC Human small airway epithelial cell line
HSV Hue, saturation and value
h-swish Hard-swish
HUVEC Human umbilical vein endothelial cell line
ILSVRC ImageNet Large Scale Visual Recognition Challenge
IoT Internet-of-Things
IoU Intersection over union
kNN K-Nearest Neighbours
LDM Latent diffusion model
LoRA Low-rank adaptation
LSTM Long short-term memory
mAP Mean average precision
MDR Multidrug-resistant [bacteria]
METC Medical Education Technology Centre
MICC-SRI Media Integration and Communication Center – Semantic Road
Inpainting (dataset)
mIoU Mean intersection over union
MiWRC Multi-input weighted residual connections
ML Machine learning
MOT Moving objects tracking
MRI Magnetic resonance image
MS COCO Microsoft Common Objects in Context (dataset)
MiWRC Multi-input weighted residual connections
NAS Neural architecture search
NHBE Normal human bronchial epithelial cell line
NLP Natural language processing
OOC Organ-on-a-chip
OS Operating system
PA Pixel accuracy
PAN Path Aggregation Network
PSKUS Pauls Stradins Clinical University Hospital
R-CNN Regions with CNN (DNN model)
RGB Red, green, and blue
RL Reinforcement learning
ReLU Rectified linear unit
SiLU Sigmoid linear unit
SPP Spatial pyramid pooling
SPPF Spatial Pyramid Pooling Fusion
SSD Single Shot Detector (DNN model)
SVM Support Vector Machine
SYNTHIA Synthetic collection of Imagery and Annotations (dataset)
TORCS The Open Racing Car Simulator
TSD Traffic Signalisation Detection
VAE Variational Autoencoder
VGG Visual Geometry Group
WHO World Health Organisation
YOLO You Only Look Once (DNN model)
viii
Introduction
From time immemorial, there has been a dream about artificial aides to humans, capable
of performing various physical and intellectual tasks on their own. Creations of that kind
were already mentioned in the earliest works of European literature: in particular, Nilsson [1]
considers ‘attendants’ that, according to Homer’s Iliad, were crafted by the blacksmith god
Hephaestus to help him get around, to be one of the earliest fictional artificial intelligence
(AI) agents. Folklore and literature abound in other characters brought to life from inanimate
matter by human ingenuity, from Galatea to Golem to Cˇapek’s robots, yet it was not until the
advent of the modern science and technology that the dreams about AI began to become more
tangible. Advances in neurobiological research on the brain structure and function and in
psychological investigations of intelligence, development of relevant mathematical methods,
research on the theory of computation, and, finally, the invention of the first computers have
all contributed to the inception of the AI revolution around the middle of the twentieth
century. Since then, AI methods have progressed from the very simple artificial models
of neurons [2] to the development of large language models such as GPT-4 [3] capable of
assisting humans in a variety of tasks, from debugging code to composing poetry. However,
the level of human-like artificial general intelligence (AGI) has not been reached yet, and the
pace of AI progress varies significantly across different fields.
As an example of disparities, let us compare and contrast achievements of AI in chess
and computer vision tasks. In chess, which has long held the ‘traditional status [...] [of]
an exemplary demonstration of human intellect’ [4], AI-based engines gained the ultimate
superiority over the best human players more than two decades ago and have maintained
that status since then, demonstrating that computers are extremely good at solving well-
defined rule-based tasks. The exemplary opposite case is computer vision: since even simple
organisms are capable of perceiving their visual environment and navigating in it, it may
appear at first glance (pun intended) that developing human-level computer vision should
not pose too great a challenge compared to other areas of AI research.
As Szeliski observes in his overview of the history of computer vision [5], this belief was
shared by at least some early pioneers of AI and robotics: in particular, according to a story
that has been a part of AI folklore for a long time now, in 1966, Marvin Minsky asked Gerald
Jay Sussman, his undergraduate student at MIT at that time, to ‘spend the summer linking
a camera to a computer and getting the computer to describe what it saw’ [6]. However,
it turned out that teaching computers to see and understand what they see is a somewhat
more complicated task than a summer project for a group of undergraduates, and more than
half a century later, research on computer vision in general and image understanding (i.e.,
‘getting the computer to describe what it saw’) in particular still presents many unresolved
challenges and exciting tasks. It is also one of the most active research areas in computer
science today, as there are many possible applications for automated systems with human or
even superhuman capacity for understanding visual scenes.
1
In this thesis, I1 am concerned with the three general kinds of image understanding
problems, namely:
• image classification: categorising images into one of several predefined classes;
• image segmentation: segmenting an image into different objects and background areas
by categorising each its pixel into one of several predefined classes [7];
• object detection: detecting instances of objects in images and categorising each instance
into one of several predefined classes [8].
My work on these problems belongs to the domain of applied computer vision, meaning
the application of computer vision methods to practical tasks in science and industry. In par-
ticular, I use computer vision methods to solve the following real-world image understanding
tasks:
• to classify hand-washing movements in the videos filmed in clinical settings for moni-
toring hand hygiene (Chapter 2);
• to perform semantic segmentation of street views for improving the navigation systems
of self-driving cars (Chapter 3);
• to detect graspable bottles in a pile for the bin-picking task carried out by a robotic
arm (Chapter 4);
• to classify microscopy images for automated monitoring of growing organs-on-a-chip
(OOC) (Chapter 5).
Furthermore, from a broader perspective, my work is also concerned with several general
methodological problems in computer vision, that is, the problems that pertain not only to
the specific task that I solve in a particular case, but to computer vision research in general.
Thus, in Chapter 2, I highlight the need for large datasets when training state-of-the-art image
classifiers. When the task is rather specific – such as classifying hand-washing movements –
the number of publicly available image datasets is often rather small, and even those may be
only partially available or originally labelled in a way that requires relabelling for the task.
As a result, in such circumstances, one needs to acquire and label the data first, which is time-
and effort-consuming and therefore makes alternative approaches such as the use of synthetic
data, which is discussed in the following, particularly topical. Furthermore, as I point out
in Chapter 2 characterising the datasets collected for hand-washing movement classification,
annotating videos with such continuous and rather complex movements as washing hands
inevitably involves discrepancies between the different human annotators of the data, which
should be accounted for and resolved in a methodologically sound way. In Chapters 3, 4, and
5 I address the problem of the availability (or lack thereof) of real-world data for training AI
models by resorting to the use of synthetic (i.e., artificially generated) data to augment real-
world datasets. The use of synthetic data has become increasingly popular in recent years,
1The conventional way to refer to oneself in a thesis is to use the third person singular (‘the author’) or
the first person plural (‘we’). However, in this thesis, I avoid the former, as I find it unnecessarily convoluted
(pun intended); as for ‘we’, I typically use it when referring to myself and the reader (as in ‘as we can see’) or
when discussing collaborative efforts. Some examples of using such a collaborative ‘we’ can be found when
I describe acquiring and labelling videos of hand-washing episodes in Chapter 2 or images of cell culture in
Chapter 5. Otherwise, I prefer the bold, honest and straightforward ‘I’, as that allows me to underscore that
my opinion is my opinion, the work that I did is the work that I did, and my mistakes and omissions are,
well, my mistakes and omissions.
2
not only in computer vision but also in many other AI domains (see [9] for a comprehensive
overview) such as natural language processing (NLP) and reinforcement learning (RL), as it
allows to reduce the cost, effort, and tedium of collecting and labelling real-world data. On
the other hand, it also necessitates dealing with the gap between real-world and synthetic data
[10], which, in case of synthetic images, is primarily caused by them being less photorealistic
than real-world images. I address the problem of using synthetic data to augment real-world
datasets in several ways. Thus, in Chapters 3 and 5, I investigate how the proportion of
synthetic data used to augment real-world image datasets affects the accuracy of semantic
segmentation and image classification, respectively, that is, whether using more synthetic
data always improves performance. I also explore various approaches to generating synthetic
data: in the research reported in Chapter 3, I generate images of street views and their
segmentation masks with the out-of-the-box open-source driving simulator Cars Learning to
Drive (CARLA, [11]); in Chapter 4, I describe how Generative Adversarial Networks (GANs,
[12]), potentially powerful but also not-so-easy-to-train and often fragile AI models, can be
used to generate synthetic images of plastic bottles; in Chapter 5, I report my work on
generating synthetic images of cell culture with the powerful generative AI model Stable
Diffusion [13].
In all studies presented in this thesis, I address image understanding challenges by means
of deep learning (DL; [14]), that is, using deep neural networks (DNN), which are complex
ensembles of artificial neurons – abstract units capable of jointly learning underlying data
representations. At an early age of computer vision research, image classification, image
segmentation, and object detection problems were dealt with by means of traditional (or
classical) computer vision methods [15], which relied on feature extraction and subsequent
use of classical machine learning algorithms such as Support Vector Machines (SVM; [16])
and k-Nearest Neighbours (kNN; [17]). However, since 2012, when the Convolutional Neural
Network (CNN) AlexNet [18] won the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) 2012, outstripping its competitors – traditional computer vision models – by a
large margin on the ImageNet [19] multiclass dataset, the popularity of DNNs has begun to
increase steadily, and they are currently regarded as state-of-the-art for most computer vision
tasks. Since DNNs have also been successfully applied in other AI domains such as time series
analysis, NLP, and content generation, research on general methodological problems in deep
learning – such as the already mentioned need for large datasets for training models, quality
control issues in large-scale datasets, and the use of synthetic data for augmenting real-world
datasets – can potentially contribute to the advancements in these fields as well.
The goal of this thesis is to provide efficient solutions for applied image understanding
tasks. The central premise of the thesis is that convolutional neural networks can success-
fully solve the image understanding tasks considered in this work. The research objectives
and hypotheses depend on the specific task and therefore are defined in the chapters of the
thesis reporting respective research. The research methods that I use in this thesis are
those commonly employed in AI and computer vision research: exploration and analysis of
relevant literature, data cleaning and preprocessing, synthetic data generation, design and
implementation of experiments involving deep neural networks, and analysis and validation
of the results of experiments.
As a result of the work presented in this thesis, I propose the following thesis statements
for defence:
• Thesis statement one: In applications of CNNs to real-world image understanding
tasks, data availability and quality present greater challenges than model selection and
customisation.
3
• Thesis statement two: CNNs that perform well when trained and evaluated on
datasets acquired in laboratory conditions may struggle to achieve similar success when
trained and evaluated on more complex real-world data.
• Thesis statement three: While state-of-the-art CNN-based image classifiers and ob-
ject detectors with a larger number of parameters typically demonstrate higher accuracy
on benchmark datasets than their counterparts with a smaller number of parameters,
this accuracy gap narrows or even vanishes when these models are trained and evaluated
on smaller, more complex real-world datasets.
• Thesis statement four: While augmenting real-world datasets with photorealistic
synthetic images is an efficient way to improve the accuracy of CNNs trained on such
data, increasing the amount of synthetic data does not directly correlate with improved
accuracy on image understanding tasks.
The above thesis statements are primarily grounded in the results found in the following
chapters: thesis statement one – in Chapters 2, 3, 4, and 5; thesis statement two – in
Chapter 2; thesis statement three – in Chapters 4 and 5; thesis statement four – in Chapters
3 and 5.
Research findings reported in this thesis have been published in the following scholarly
articles indexed in Elsevier Scopus and/or Web of Science databases:
1. M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Gromova,M. Ivanovs, A. Skadins,
R. Kadikis, and A. Elsts, “Hand-washing video dataset annotated according to the
World Health Organization’s hand-washing guidelines,” Data, vol. 6, no. 4:38, 2021.
The length of the publication: 6 pages.
My approximate contribution: 10− 15%.
2. A. Elsts, M. Ivanovs, R. Kadikis, and O. Sabelnikovs, “CNN for hand washing move-
ment classification: What matters more – the approach or the dataset?,” in 2022
Eleventh International Conference on Image Processing Theory, Tools and Applications
(IPTA), pp. 1–6, IEEE, 2022.
The length of the publication: 6 pages.
My approximate contribution: 30− 40%.
3. M. Ivanovs, K. Ozols, A. Dobrajs, and R. Kadikis, “Improving semantic segmentation
of urban scenes for self-driving cars with synthetic images,” Sensors, vol. 22, no. 6:
2252, 2022.
The length of the publication: 13 pages.
My approximate contribution: 90%.
4. D. Duplevska, M. Ivanovs, J. Arents, and R. Kadikis, “Sim2Real image translation
to improve a synthetic dataset for a bin picking task,” in 2022 IEEE 27th Interna-
tional Conference on Emerging Technologies and Factory Automation (ETFA), pp.
1–7, IEEE, 2022.
The length of the publication: 7 pages.
My approximate contribution: 30− 40%.
4
5. M. Ivanovs, L. Leja, K. Zviedris, R. Rimsa, K. Narbute, V. Movcana, F. Rumnieks,
A. Strods, K. Gillois, G. Mozolevskis, A. Abols, and R. Kadikis, “Synthetic image
generation with a fine-tuned latent diffusion model for organ on chip cell image clas-
sification,” in 2023 Signal Processing: Algorithms, Architectures, Arrangements, and
Applications (SPA), pp. 1–6, IEEE, 2023.
The length of the publication: 6 pages.
My approximate contribution: 90%.
6. V. Movcˇana, A. Strods, K. Narbute, F. Ru¯mnieks, R. Rimsˇa, G. Mozol¸evskis, M.
Ivanovs, R. Kadikis, K. Zviedris, L. Leja, A. Zujeva, T. Laimin¸a, and A. Abols,
“Organ-On-A-Chip (OOC) Image Dataset for Machine Learning and Tissue Model
Evaluation,” Data, vol. 9, no. 2:28, 2024.
The length of the publication: 10 pages.
My approximate contribution: 15− 20%.
Research findings included in this thesis have also been reported in the following scholarly
publications that are not indexed in Elsevier Scopus or Web of Science databases:
1. M. Ivanovs, R. Kadikis, M. Lulla, A. Rutkovskis, and A. Elsts, “Automated quality
assessment of hand washing using deep learning,” arXiv:2011.11383, 2020.
The length of the publication: 8 pages.
My approximate contribution: ≥ 50%.
2. O. Zemlanuhina, M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Melbarde-Kelmere,
A. Elsts, M. Ivanov2, and O. Sabelnikovs, “Influence of different types of real-time
feedback on hand washing quality assessed with neural networks/simulated neural net-
works,” in SHS Web of Conferences, vol. 131, p. 02008, EDP Sciences, 2022.
The length of the publication: 13 pages.
My approximate contribution: 10%.
I have presented research findings reported in this thesis at the following conferences:
1. IEEE International Conference on Image Processing Theory, Tools & Applications –
IPTA 2022, Salzburg, Austria, 2022.
Presentation CNN for Hand Washing Movement Classification: What Matters More
— the Approach or the Dataset?
2. 27th IEEE International Conference on Emerging Technologies and Factory Automation
– EFTA 2022, Stuttgart, Germany.
Presentation Sim2Real Image Translation to Improve a Synthetic Dataset for a Bin
Picking Task.
3. SEMICON Europa 2022 Conference, Munich, Germany, 2022.
Presentation Synthetic Data for Robotics: Opportunities and Challenges.
2Note the misspelled surname: Ivanov rather than Ivanovs.
5
4. 26th IEEE Signal Processing: Algorithms, Architectures, Arrangements and Applica-
tions Conference – SPA 2023, Poznan, Poland, 2023.
Presentation Synthetic Image Generation With a Fine-Tuned Latent Diffusion Model
for Organ on Chip Cell Image Classification.
In addition to the above publications, during my doctoral studies, I have also contributed
to the following publications that are not included in this thesis:
1. J. Judvaitis, A. Mednis, V. Abolins, A. Skadins, D. Lapsa, R. Rava, M. Ivanovs, and
K. Nesenbergs, “Classification of actual sensor network deployments in research studies
from 2013 to 2017,” Data, vol. 5, no. 4:93, 2020.
2. A. Skadins, M. Ivanovs, R. Rava, and K. Nesenbergs, “Edge pre-processing of traffic
surveillance video for bandwidth and privacy optimization in smart cities,” in 2020
17th Biennial Baltic Electronics Conference (BEC), pp. 1–6, IEEE, 2020.
3. R. Rava, M. Ivanovs, A. Skadins, and K. Nesenbergs, “World coordinate virtual traf-
fic cameras: Edge-based transformation and merging of multiple surveillance video
sources,” in 2020 7th International Conference on Soft Computing & Machine Intelli-
gence (ISCMI), pp. 233–236, IEEE, 2020.
4. M. Ivanovs, R. Kadikis, and K. Ozols, “Perturbation-based methods for explaining
deep neural networks: A survey,” Pattern Recognition Letters, vol. 150, pp. 228–234,
2021.
5. M. Ivanovs, B. Banga, V. Abolins, and K. Nesenbergs, “Methods for explaining CNN-
based BCI: A survey of recent applications,” in 2022 IEEE 16th International Scientific
Conference on Informatics (Informatics), pp. 137–141, IEEE, 2022.
6. B. Cugmas, E. Sˇtruc, I. Berzina, M. Tamosiunas, L. Goldberga, T. Olivry, K. Zviedris,
R. Kadikis, M. Ivanovs, M. Bu¨rmen, and P. Naglicˇ, “Automated classification of
pollens relevant to veterinary medicine,” in 2024 IEEE 14th International Conference
Nanomaterials: Applications & Properties (NAP), pp. 1–4, IEEE, 2024.
7. B. Cugmas, E. Sˇtruc, M. Tamosiunas, L. Goldberga, I. Berzina, R. Kadikis, M.
Ivanovs, S. Warshaneyan, and P. Naglicˇ, “Comparison of two fixation methods in
automated pollen classification on whole slide images,” in Latin America Optics and
Photonics Conference, Optica Publishing Group, 2024.
I have presented the results of these studies at the following conferences:
1. Latvijas A¯rstu kongress: Zina¯tniski praktiska¯ sesija “Lielie dati medic¯ına¯”, Riga, Latvia,
2022.
Presentation Ma¯ksl¯ıgais intelekts un atte¯lu apstra¯de medic¯ınas pielietojumiem.
2. IEEE 16th International Scientific Conference on Informatics – Informatics 2022, Pop-
rad, Slovakia, 2022.
PresentationMethods for Explaining CNN-Based BCI: A Survey of Recent Applications.
I conducted and approbated my research for this thesis at the Institute of Electronics and
Computer Science (EDI – Elektronikas un datorzina¯tn¸u institu¯ts). The research was part
of several scientific projects at EDI and was financially supported by their funding. The
following is the list of these projects:
6
1. Programmable Systems for Intelligence in Automobiles – PRYSTINE (Horizon 2020
ECSEL Joint Undertaking funding under grant agreement 783190).
2. Efficient module for automatic detection of people and vehicles using video surveillance
cameras – VAPI (ERDF project No. 1.2.1.1/18/A/006 research No. 1.5).
3. Automated hand washing quality control and hand washing quality evaluation system
with real-time feedback – Handwash (project No. lzp-2020/2-0309).
4. Integration of reliable technologies for protection against Covid-19 in healthcare and
high risk areas – COV-CLEAN (project No. VPP-COVID-2020/1-004).
5. Intelligent Motion Control under Industry 4.E – IMOCO4.E (Horizon 2020 ECSEL
Joint Undertaking funding under grant agreement 101007311).
6. AI-Improved Organ on Chip Cultivation for Personalised Medicine – AImOOC
(contract with Central Finance and Contracting Agency of Republic of Latvia no.
1.1.1.1/21/A/079; the project was cofinanced by REACT-EU funding for mitigating
the consequences of the pandemic crisis).
7. Holographic microscopy- and artificial intelligence-based digital pathology for the next
generation of cytology in veterinary medicine – VetCyto (project No. lzp-2023/1-0220).
The thesis consists of five chapters, followed by a conclusion and bibliography. In Chap-
ter 1, I provide the background of my work, discussing the fields of computer vision and
deep learning and their intersection. Chapter 2 focuses on the use of CNNs for hand-washing
movement classification; I present the datasets collected in the course of research as well
as the results of training and evaluating CNN models on them. Chapter 3 addresses the
use of CNNs for semantic segmentation of urban street views. In Chapter 4, I detail the
use of CNN-based object detectors for a bin-picking task carried out by a robotic arm. In
Chapter 5, I report on the application of CNNs to classifying cell culture images with the
goal of automating the process of growing organs-on-a-chip. Finally, in Conclusion, I revisit
the key findings of the research presented in this thesis, reflect on the challenges encoun-
tered throughout the work, substantiate the four thesis statements I propose for defence,
and suggest directions for future research aimed at advancing computer vision methods and
overcoming the current limitations in image understanding tasks.
7
Chapter 1
Background
The goal of this chapter is to outline the relevant background for research on image under-
standing within the context of applied computer vision. The structure of the chapter is as
follows. In Section 1.1, I define computer vision – first descriptively and informally, then
more rigorously and formally – and discuss its scope as well as explore some of its highlights.
In Section 1.2, I narrow the focus of the discussion to the subfield of computer vision – image
understanding – and discuss three major tasks in that subfield that are particularly relevant
to the work presented in this thesis: image classification, image segmentation, and object de-
tection. Furthermore, I present the key metrics for evaluating the performance of algorithms
and systems on these tasks. In Section 1.3, I examine methods for solving image understand-
ing tasks. I briefly introduce classical methods, but primarily focus on deep learning-based
approaches, as these are central to the research in this thesis. In Section 1.4, I explore the
importance of datasets for training machine learning (ML) models and the challenges related
to their availability. I also discuss how transfer learning, data augmentation, and synthetic
data can help mitigate the problem of the scarcity of training data. Finally, in Section 1.5,
I offer some concluding remarks.
1.1 Computer vision: definition, scope, and highlights
Providing a comprehensive overview of a field as vast, complex, multifaceted, and rapidly
evolving as computer vision is a truly daunting task; attempting such an overview in a single
chapter of a PhD thesis would be overly ambitious and likely unfeasible. Therefore, instead of
trying to survey the field of computer vision in its entirety, I limit myself to the more modest
task of defining what computer vision is and highlighting the aspects most relevant to this
thesis. For a more detailed treatment of the subject, I refer readers to the fundamental works
by Ja¨hne et al. [20], Szeliski [5], Hartley and Zisserman [21], and Forsyth and Ponce [22] as
well as more recent, though less comprehensive, surveys of the state of the art in computer
vision by O’Mahony et al. [15] and Feng et al. [23].
Regardless of the scope of the discussion, a rigorous approach dictates that I must define
the term ‘computer vision’ in a manner that is both comprehensive and formal. However,
to motivate my choice of the definition, I first consider the notion of computer vision more
informally. A suitable starting point is Minsky’s famous suggestion1 for a research project
that I already mentioned in the Introduction to this thesis: to link a camera to a computer
1As a caveat, one should not oversimplify Minsky’s proposal, which in fact was quite elaborated – see
the memo of the ‘Summer Vision Project’ drafted by Seymour Papert [24]. All in all, I mention it here as a
convenient way to initiate the discussion rather than to belittle Minsky’s ambition.
8
and get the computer to describe what it is seeing [6]. Indeed, it appears that this is largely
what researchers and software engineers in computer vision do: with the help of computers,
they transform input from visual sensors into some meaningful representation of the original
visual scene. While such a setup is ubiquitous, it is far from trivial, as there are many aspects
to consider, such as:
• What kind of input does the system receive? As human thinking tends to be anthro-
pocentric, the first association when considering input to a camera linked to a computer
might be that it captures a typical visual environment, that is, an RGB video stream
from the part of the electromagnetic spectrum visible to humans, ranging from ≈400
nm to ≈700 nm. However, it does not seem reasonable to exclude other segments of
electromagnetic spectrum, such as ultraviolet (UV) light, infrared light, or X-rays, from
the scope of computer vision. While these wavelengths are not visible to the human
eye per se, the same methods as the ones used for processing and analysing visible light
can often be applied to them.
• Is the input treated as a continuous stream or as divided into discrete units? While
humans perceive their visual environment as a continuous flow of information, the
temporal dimension of such input makes it more challenging to process and analyse
in computer vision. Circumventing this challenge, much of computer vision research
focuses on images rather than video: there are more image datasets than video datasets
for training and benchmarking models, and image generation, processing, and analysis
methods are more developed. While the focus on images is practical for these reasons, it
makes computer vision markedly different from biological vision, where treating static
snapshots of visual scenes as the primary units of analysis may seem artificial.
• What kind of output do we expect from the system? The output of the human visual
system is, essentially, seeing the world, and understanding what ‘seeing’ actually means
has kept philosophers and cognitive scientists occupied for a long time. In contrast,
the typical output of computer vision systems has historically often been discrete and
rather simple, such as matching an entire image to a single label. The same as with
the previous point, this approach has been customary due to practical considerations,
as simple output makes it easier to design, benchmark and analyse computer vision
systems. However, it also makes these systems markedly different from their biological
counterparts. Conversely, developing systems that produce more complex output, such
as elaborate and complex descriptions of visual scenes, brings computer vision closer
to other fields of AI such as NLP.
• How do we evaluate the quality of the output? Simple evaluation metrics (e.g., the
number of correctly classified images) are easier to implement, yet they can oversimplify
tasks compared to the complexity of real-world visual environments.
• Where (if at all) do we draw the line between vision and cognition? This question
arguably arises whenever understanding is required in a computer vision task. Further-
more, similar to biological visual systems, which incorporate acquired experience into
their function rather than operate as blank slates, their digital counterparts also inte-
grate world knowledge, and as a consequence, vision and cognition become intrinsically
intertwined.
• To what extent are the inner workings of the system comprehensible and transparent?
The human visual system is highly efficient but notoriously complex, and only partly
9
understood. While some of the difficulty in studying it arises from the impossibility
of conducting invasive experiments on humans, it is also true that even when such
experiments are possible, e.g., in case of laboratory animals, scientific endeavours have
so far provided only partial understanding of how vision functions, and the brain, to
a large extent, still remains a ‘black box’. In case of computer vision system, the best
results on many computer vision tasks have been achieved with deep neural networks,
which in their essence are also ‘black boxes’: while it is easier to pry them open (e.g.
to extract the state of a particular artificial neuron or its response to a stimulus) than
their biological counterparts, analysing these systems is still challenging due to the large
number of parameters in a typical DNN model. As a result, computer vision systems
often match their biological counterparts in terms of their opacity.
The questions posed above are best treated not as requiring immediate and straightfor-
ward answers, but rather as signposts guiding research paths in computer vision. At the
same time, they lead me to conclude the following:
• Computer vision deals not only with input in the visible light spectrum, but also with
wavelengths invisible to humans, such as ultraviolet light, infrared light, and X-rays.
• The three main components of a computer vision system are input (including input
formation), information processing algorithms, and output. These three components of
computer vision are interrelated: thus, as Ja¨hne et al. [25] observe, in their work, they
regard computer vision ‘from image formation to measuring, recognition, or reacting’
as an integral process. Overall, this is a reasonable approach: for instance, the type of
input data may affect the choice of information processing algorithms, and the desired
output format may influence the choice of the visual sensors for acquiring input. How-
ever, I would also like to note that in practice, it is quite common in computer vision to
focus on some particular stage(s) of the said integral process at the expense of taking
other stages for granted. For instance, a researcher aiming to improve the performance
of a particular algorithm on a specific dataset is unlikely to spend too much (if any at
all) time pondering the physics of image formation, as it bears no particular relevance
to their immediate task. Such a reductionist approach certainly has its advantages,
but it also may result in oversights when some aspects of computer vision are actually
important but are not accounted for in a particular study or setup.
• The interrelation of different aspects of computer vision makes it a multidisciplinary
research field that brings together physics, mathematics, computer science, biology,
physiology, cognitive sciences, and other disciplines.
• The boundaries between vision and other perceptual and cognitive processes, such as
language processing, are not clear-cut but rather blurry. This is because computer
vision tasks often involve at least some level of understanding of the visual world,
and the output of a computer vision system is frequently expressed in the form of a
verbal description. Therefore, the said boundaries should be drawn cautiously, if at all,
depending on the specific research problem.
• While evaluation metrics are necessary and useful for comparing and improving com-
puter vision algorithms, they should be taken critically2, as they tend to oversimplify
the challenges compared to those faced by visual systems operating in real-world envi-
ronments.
2Cf. Goodhart’s Law, which states that a measure ceases to be useful when it becomes a target itself.
10
Although the above observations do not provide a comprehensive analysis of computer
vision but rather highlight some of its noteworthy characteristics, I believe that they con-
vincingly demonstrate the complexity and scope of computer vision as a field of research and
engineering, and, by extension, show that producing a rigorous and sufficiently informative
academic definition of computer vision is not straightforward. To demonstrate the obstacles
that one may encounter in a quest for such a definition, let us consider how computer vision is
defined in a reputable source for the general audience, such as The Encyclopædia Britannica.
According to Britannica, it is ‘[a] field of artificial intelligence in which programs attempt to
identify objects represented in digitised images provided by cameras, thus enabling comput-
ers to “see”’ [26]. While this definition is essentially correct, it includes several contentious
points, namely:
• Computer vision is treated as a subfield of artificial intelligence. While most modern
computer vision methods are indeed rooted in AI, there are also algorithms – for in-
stance, for optical flow estimation [27, 28], colour-based segmentation [29, 30], and edge
detection [31] – that are based on explicit programming and predefined mathematical
(especially geometrical) principles.
• Objects are mentioned as the only type of entities that computer vision is concerned
with. To provide just a single yet convincing counterexample, segmentation algorithms
aim to identify not only objects but also various background elements (e.g., sky, terrain,
water surface, or wall) in images.
• Images are considered the primary data source for computer vision systems. However,
as previously mentioned, images are artificial entities, and working with them differs
from processing the continuous visual stream characteristic of the real world. Moreover,
an excessive focus on images as the primary units of analysis may, in my opinion, hinder
the overarching goal of computer vision to enable computers to truly see.
In my search for a better and more precise definition of computer vision, I have found
and henceforth adopt the one given in the seminal work by Ja¨hne et al. [25], which defines
computer vision as ‘the host of techniques to acquire, process, analyze, and understand
complex higher-dimensional [visual] data from our environment’. Note that I have added the
term ‘[visual]’ in square brackets to clarify the meaning, as the original definition is given in a
computer vision textbook, where the context makes the phrase ‘complex higher-dimensional
data’ clear, whereas for standalone use, including ‘visual’ may be helpful.
As the last note here on the definition and scope of computer vision, I would like to
differentiate it from digital image processing. While some (admittedly, rather few) works, e.g.,
O’Mahony et al. [15], treat these terms as equivalent, I follow the more prevalent approach
in this thesis by distinguishing them. Specifically, I adopt the approach of Pitas et al. [32]
and maintain that digital image processing refers to low-level operations with images such
as image enhancement (e.g., adjusting brightness or contrast) or colour image processing. In
contrast, the broader term ‘computer vision’ (and the similar but less frequently used ‘digital
image analysis’) also includes higher-level operations with visual data such as object detection
and image segmentation. Depending on usage, computer vision may subsume digital image
processing, but not the other way around, as the latter is substantially narrower in scope.
11
1.2 Image understanding: definition, main tasks,
metrics
As established in the previous section, computer vision spans multiple disciplines, from
physics to psychology to philosophy, and is concerned with a wide range of research prob-
lems. I also argued that it is a common practice in the field of computer vision to focus on
some of its aspects rather than trying to encompass the full breadth of research questions
in it. As this thesis also follows the common practice of narrowing the scope of research, I
will introduce a narrower term within the domain of computer vision to refer to the research
problems that my work is primarily concerned with. For that purpose, I henceforth use the
term ‘image understanding’, which has been used in a number of well-regarded studies,
e.g. [32], [33], [34], [35], though it is notably less prevalent in the literature than the term
‘computer vision’ (cf. Figure 1.1).
Figure 1.1: Annual trends in Scopus-indexed publications on computer vision and image un-
derstanding (1955-2023). The graph shows the number of publications per year, filtered by
the search terms computer AND vision and image AND understanding within the ‘Com-
puter Science’ and ‘Engineering’ subject areas. The data were retrieved on 20 January 2024.
I use the term ‘image understanding’ as defined by Zhang [36], that is, as a suite of meth-
ods and techniques that ‘attempt to interpret the meaning of image at a high level to provide
semantic information closely related to human thinking, and help further to make decisions
and to guide the actions according to the understanding of scenes.’ Some of the primary tasks
within image understanding are image classification, image segmentation, object detection,
object localisation, image captioning, optical character recognition, and pose estimation. My
research that forms the foundation of the present thesis is concerned with the first three of
these tasks, that is, image classification, image segmentation, and object detection. There-
fore, I dedicate the remainder of this section to defining and briefly discussing these tasks.
However, prior to that, I will address potential objections to my choice of the term ‘image
understanding’ to refer to them.
First, I address the possible question of why I use the term ‘image understanding’ instead
of directly referring to image classification, image segmentation, and object detection. The
practical reason is that ‘image understanding’ is more concise and therefore more convenient,
avoiding the need to list all three tasks each time I refer to them collectively. Another,
more general reason is my belief that research on image classification, image segmentation,
12
object detection, and other similar tasks should not only aim at improving the performance
of computer systems on these specific tasks, but also contribute to the overarching goal of
enabling computers to understand images (and visual data in general) at a level comparable
to human understanding of our visual environment. While the contributions of my work
in this thesis to the advancement of this goal are incremental rather than fundamental, and
while the convolutional neural networks that I employed for solving practical computer vision
tasks are far from actually understanding their input in the way humans do, I believe that it
is still valuable to use the term ‘image understanding’ so that, figuratively speaking, it would
remind us about the forest looming behind the trees.
Second, I address the potential question of why I prefer the term ‘image understanding’
over another umbrella term for the three tasks in question, namely, ‘classification tasks’.
Indeed, the latter term is sometimes used to jointly refer to image classification, image
segmentation, and object detection. For image classification, this is straightforward, as the
task inherently involves classification; image segmentation can be viewed as a classification
task at the pixel level; finally, object detection is often carried out using a two-stage process:
first, object localisation, followed by object classification. However, there are several reasons
why I find the term ‘classification tasks’ less appropriate in the given context. To begin
with, this term can easily be misunderstood as referring solely to image classification – just
one out of the three tasks it aims to encompass. Furthermore, the term ‘classification tasks’
does not emphasise the ultimate goal of understanding the semantics of the visual world as
effectively as the term ‘image understanding’ does. Finally, while many state-of-the-art object
detection methods consist of two stages and do involve classification in the second stage, not
all object detection methods rely on classification – for instance, objects can be detected by
detecting changes (cf. e.g. [37]). Therefore, categorising all object detection methods as a
subtype of classification task would be an overgeneralisation. As a consequence, using the
term ‘classification tasks’ for a group of tasks that includes object detection would arguably
be rather imprecise.
Having defined image understanding and justified my use of the term, I now turn to the
three image understanding tasks central to this thesis: image classification, image segmen-
tation, and object detection (see examples in Figure 1.2). In addition, I will introduce the
common evaluation metrics for these tasks.
Figure 1.2: Examples of image classification, object detection, instance segmentation, and
semantic segmentation. Reproduced from [38].
13
Image classification, in its essence, is a labelling procedure [39]: a classifier labels a
given image I as belonging to a single class Ci, which is an element of a fixed set of considered
classes C = {C1, C2, C3, ..., CN}. Several variations can arise from this general definition.
Thus, if |C| = 2, the classification is said to be binary ; to distinguish the situation with |C| >
2 from binary classification, it is commonly called multiclass classification. Furthermore, in
some situations, it is more appropriate to assign two or more labels to an image rather just a
single label, as the image may feature more than one object in it; in such a case, classification
becomes multilabel. As Ja¨hne et al. [25] note, other variations in the general labelling scheme
can be implemented depending on the task. For instance, we might assign a single label ‘cat’
to all images of cats, or classify them by breeds. Similarly, objects of the same shape but
different colours might be classified as belonging to either a single class or different classes,
depending on the specific criteria.
Image classification is often considered the most basic task in the field of image under-
standing – for instance, it is often said that the ‘Hello World’ task of deep learning for
computer vision is training a neural network to classify hand-written digits in the MNIST
dataset [40]. However, as Rawat and Wang [41] point out, image classification still poses a
number of challenges for automated systems; examples of such challenges include variability
in the appearance of objects depending on the viewpoint (e.g., the rear of the car looks com-
pletely different from its front) and the high variability of objects within the same class (e.g.,
within the broad class of ‘plants’)[42]. Overcoming these challenges and limitations is crucial
for advancing image understanding, since image classification serves as the foundation for
several other tasks, particularly semantic segmentation and object detection.
The main metrics for image classification are accuracy, precision, recall, and F1 score.
Among those, accuracy is the most straightforward and commonly used metric. Given that
true positives are denoted as TP, true negatives as TM, false positives as FP, and false
negatives as FN,
Accuracy =
TP + TN
TP + TN + FP + FN
. (1.1)
Precision quantifies the ability of the classifier to identify true positives while producing
a low rate of false positives. The formula for calculating precision is as follows:
Precision =
TP
TP + FP
. (1.2)
Recall (also called sensitivity), which is often reported alongside precision, quantifies the
ability of the classifier to identify true positives while producing a low rate of false negatives.
Therefore, it may be interpreted as a measure of the completeness of a classifier. The formula
for calculating recall is as follows:
Recall =
TP
TP + FN
. (1.3)
Finally, F1 score, a weighted average of precision and recall, is calculated as
F1 = 2× Precision× Recall
Precision + Recall
=
2TP
2TP + FP + FN
. (1.4)
F1 score is often more informative and less misleading than accuracy, particularly in case
14
of datasets with imbalanced classes. This is because it ensures that both the precision and
recall of the model are accounted for. In contrast, accuracy can indeed be deceptive in such
cases: for instance, a classifier may achieve an ostensibly impressive accuracy of 90% on a
dataset where 90% of the data belongs to the largest class, simply by labelling every test
input as belonging to that class. However, the outcome of such ‘learning’ can hardly be
considered a success, and an resulting ‘naive’ classifier3 is essentially useless. That said, the
F1 score also has the drawback of being less intuitive than accuracy.
Object detection has a wide range of real-world applications, including autonomous driv-
ing, intelligent video surveillance, robotics, and security [43]. It is a more complex task than
image classification, because an object detector needs to both assign each object in the im-
age a label Ci from the fixed set of considered classes C = {C1, C2, C3, ..., CN}, and identify
the positions of these objects. In other words, while an image classifier aims to answer the
question What object is there?, an object detector, as Zou et al. [44] put it, needs to answer
the question What objects are where?
Localisation of objects is typically done by drawing a bounding box around each of them.
A bounding box is a rectangular4 frame that fully encloses the object. As Padilla et al. [46]
point out, the most common way to represent a bounding box is by providing its top left
and bottom right coordinates, i.e, (xinit, yinit, xend, yend). However, they also note that the
one of the most popular families of object detection algorithms, YOLO detectors [47], uses
a different representation. In YOLO, a bounding box is defined by providing the x and y
coordinates of its center and its width and height as:(
xcenter
image width
,
ycenter
image height
,
box width
image width
,
box height
image height
)
[46].
According to a comprehensive survey of object detection metrics by Padilla et al. [48], the
primary metrics for this task include precision, recall, and several metrics derived from them,
such as average precision, mean average precision, and average recall. All of these metrics are
generally based on measuring how closely predicted bounding boxes Bp correspond to ground
truth bounding boxes Bgt. The overlap between the former and the latter is calculated using
the intersection over union (IoU) measurement:
IoU =
area(Bp ∩Bgt)
area(Bp ∪Bgt) (1.5)
The visual representation of calculating IoU is shown in Figure 1.3 (a). Figure 1.3 (b)
illustrates different IoU values, ranging from no overlap at all with IoU = 0 (left) to a perfect
overlap with IoU = 1 (right).
For the calculation of precision and recall in the context of object detection, the same
formulas are used as for image classification, i.e., the formulas given by Equations 1.2 and 1.3,
respectively. However, the key difference is that in case of object detection, a detected object
is treated as a true positive or a false positive depending on whether it meets a predefined
IoU threshold. For instance, if the IoU threshold is set to 0.5, detected objects with IoU ≥
0.5 are counted as true positives, those with lower IoU values are treated as false positives,
3Here and in the following, a ‘naive’ classifier refers to a model that operates in a primitive fashion, either
assigning classes at random, or, in the case of imbalanced datasets, always assigning the majority class. This
should not be confused with the naive Bayes classifier – a well-established machine learning method.
4Polygons are also used in some cases (see e.g. [45]), as they may capture the shapes of real objects better
than rectangles.
15
(a) (b)
Figure 1.3: Illustration of the intersection over union (IoU): (a) visualisation of the IoU
calculation; (b) visual examples of different IoU values. Adapted from [49].
and any undetected objects are counted as false negatives. The average precision (AP) and
average recall (AR) for class Ci in a dataset consisting of n images are calculated as follows:
APCi =
1
n
n∑
j=1
PCij (1.6)
ARCi =
1
n
n∑
j=1
RCij . (1.7)
Since datasets typically contain multiple classes C1, C2, C3, ...CN , the average precision
values for each class can be summarised by calculating the mean Average Precision (mAP)
across all classes:
mAP =
1
N
N∑
i=1
APCi . (1.8)
While the mAP is often calculated with a fixed IoU threshold set to 0.5, another common
approach is to compute several mAP values with IoU thresholds ranging from 0.5 to 0.95 in
steps of 0.05, and then average these values. This is expressed as:
mAPavg =
1
|T |
∑
t∈T
1
N
N∑
i=1
APtCi (1.9)
where |T | is the number of IoU thresholds T = {0.5, 0.55, 0.6, . . . , 0.95}, N is the number of
classes, and APtCi is the average precision for class Ci at IoU threshold t. For clarity, the
resulting value, mAPavg, is often denoted as mAP(avg for IoU ∈ [0.5 : 0.05 : 0.95]).
Image segmentation has applications across various domains, from medical image analysis
to self-driving cars. The goal of segmentation is to partition an input image into multiple
segments (i.e., continuous groups of pixels), sometimes referred to as ‘superpixels’. Natu-
rally, the partitioning should be meaningful in some sense rather than random. Thus, one
of the main types of image segmentation, semantic segmentation, has the goal of labelling
each pixel of the resulting segments with a single class label Ci belonging to a fixed set of
considered classes {C1, C2, C3, ...CN}. Therefore, semantic segmentation is essentially multi-
class classification at the pixel level; as it is performed for each pixel of an input image, it
is generally considered a more challenging task than image classification [50]. An even more
16
challenging variant is instance segmentation, which distinguishes not only between different
classes but also between different instances of the same class. For instance, for the example
provided in Figure 1.2, a semantic segmentation algorithm will assign the label dog to some
pixel (Figure 1.2, D), but for instance segmentation, more detailed labelling is needed to
indicate whether the algorithm considers the pixel in question as belonging to dog1 or dog2.
(Figure 1.2, C).
The most common metrics for evaluating image segmentation are pixel accuracy, mean
pixel accuracy, IoU, and mIoU.
Pixel accuracy (PA) is the ratio of correctly classified over the total number of pixels. For
N classes, it is calculated as follows:
PA =
∑N
i=1 pii∑N
i=1
∑N
j=1 pij
, (1.10)
where pij is the number of pixels of class i predicted as belonging to the class j (where i = j
or i ̸= j).
Mean Pixel accuracy (mPA) is the average PA computed across all classes in the dataset,
that is,
mPA =
1
N
N∑
i=1
pii∑N
j=1 pij
. (1.11)
While both PA and mPA are simple and intuitive metrics, they can misrepresent the
performance of a segmentation algorithm when dealing with datasets containing imbalanced
classes. In particular, an algorithm may achieve high values for these metrics by accurately
predicting the most common class while performing poorly on less represented classes. Fur-
thermore, neither PA nor mPA takes into consideration the localisation of the predicted
pixels, making these metrics less robust. Therefore, they appear less frequently in the litera-
ture than IoU and mIoU, which are the leading metrics for evaluating semantic segmentation
algorithms. IoU for segmentation is calculated similarly to IoU for object detection, with the
key difference being that, in segmentation, the intersection between the ground truth and the
prediction is computed for segmentation masks rather than bounding boxes. The formula
for IoU for semantic segmentation with N classes is:
IoU =
∑N
i=1 pii∑N
i=1
∑N
j=1 pij +
∑N
i=1
∑N
j=1 pji −
∑N
i=1 pii
[51]. (1.12)
mIoU, which averages the IoU across all classes in the dataset, is calculated as follows:
mIoU =
1
N
N∑
i=1
pii∑N
j=1 pij +
∑N
j=1 pji − pii
[51], (1.13)
where, as mentioned above, pij is the number of pixels of class i predicted as belonging to
class j (where i = j or i ̸= j).
1.3 Methods for solving image understanding tasks
Having outlined the main image understanding tasks and the metrics for evaluating perfor-
mance of computer systems on these tasks, I proceed with a brief overview of the methods
17
for solving these tasks. Broadly speaking, these methods can be divided into two principal
categories: classical methods, and deep learning-based methods.
1.3.1 Classical methods
The defining characteristic of classical methods for image understanding is that they are
fully or partially based on explicitly programmed algorithms and rely on hand-crafted features
and rules designed by human experts in specific domains (see Figure 1.4, (a)).
(a)
(b)
Figure 1.4: Comparison between the workflows for model design: (a) the workflow of classical
computer vision methods; (b) the deep learning workflow. Adapted from [52].
As Szeliski [5] points out, some examples of classical methods for image classification
are bag-of-words algorithms and parts-based algorithms. Bag-of-words (also known as bag-
of features, bag-of-keypoints, and bag-of-keypatches [53, 54]) algorithms in computer vision
were originally inspired by the bag-of-words approaches to text classification in NLP. In im-
age classification, the ‘words’ in question are visual words, i.e., handcrafted visual features.
Bag-of-words algorithms classify images by computing the distribution of visual words in the
target image and comparing it to the distribution that the algorithm learned from the train-
ing data using a classifier such as k-Nearest Neighbours (kNN) or Support Vector Machine
(SVM)[5](see Figure 1.5).
Figure 1.5: Example of bag-of-words algorithm pipeline. Reproduced from [54].
While bag-of-words algorithms are conceptually simple and intuitive, they suffer from
two major shortcomings: first, the visual words are not particularly semantically meaningful;
second, these algorithms do not take into account the spatial location of the visual words
18
in the image. To address these limitations and mitigate their effect on classifier accuracy,
parts-based algorithms have emerged. These algorithms differ from bag-of-words approaches
by relying on more semantically meaningful constituent parts of the classified objects (e.g.,
the wheels of a motorcycle) and making use of the geometric relationships between these
parts. Different parts-based methods vary substantially in how they model these geometric
relationships – for instance, as constellations, stars, trees, hierarchical models, or sparse
flexible models.
In the early stages of object detection research, the primary approach was the sliding
window method, which involves dividing the target image into subwindows and applying a
classifier to each of them to determine whether it contains an instance of the object (see
Figure 1.6).
Figure 1.6: Example of a sliding window pipeline for object detection. Reproduced from [55].
To implement this approach successfully, several challenges need to be addressed, namely,
an algorithm has to classify the object(s) in each subwindow while also handling variations
in object size and overlap between image windows [56]. As Szeliski [5] points out, the de-
velopment of classical methods for general object detection was largely driven by attempts
to solve the PASCAL Visual Object Classes (VOC) challenge [57]. An early approach that
achieved comparative success on the PASCAL VOC dataset involved running an SVM clas-
sifier on features extracted from subwindows obtained with branch-and-bound search [58] or
selective search based on hierarchical segmentation [59]. However, while classical methods
made notable progress in detecting specific types of objects such as faces [60] and pedestrians
[61], the task of detecting general object categories was not only more challenging than image
classification but was also considered nearly intractable with classical methods.
Due to the complexity of semantic segmentation, it poses an even greater challenge for
classical methods than image classification or object detection. To tackle this, early studies
often employed bottom-up approaches, relying on hand-crafted low-level image features such
as the smoothness and continuity of image region boundaries [62], texture [63], or colour
[64]. An alternative to that is a top-down approach: for instance, Borenstein and Ullman
[65] proposed matching putative segments in target images with stored representations of
the shapes of objects from a particular class. Furthermore, since semantic segmentation
is essentially classification at the pixel level, the concept of visual words – as previously
mentioned, first applied to image classification in the domain of computer vision – found its
use for semantic segmentation as well. Thus, in Schroff et al. [66], each class was modelled
with a single histogram of visual words; afterwards, during classification, a label was assigned
to each pixel by generating visual words in its neighbouring region, calculating a histogram
19
for them, and then identifying the closest match among class histograms to that histogram
by finding the shortest Euclidean distance. However, as Yu et al. [67] note, while the method
of single-histogram class models is easy to implement, its disadvantage is that it treats each
pixel individually, whereas the very concept of dividing image into segments or superpixels
implies that they are associated with each other. Another pixel-wise method, TextonBoost
[68, 69], sought to better capture local information by employing a Conditional Random
Field (CRF, [70]). The CRF took extracted image features such as texture, layout, colour
distribution, location, and the edges of putative objects as its input; the use of these features
allowed the CRF to model contextual relationships between pixels. However, as Sturgess et
al. [71] observe, the rough shape and texture model of TextonBoost resulted in imprecise
representation of object boundaries.
Overall, CRF-based, top-down, and bottom-up classical semantic segmentation methods
were not sufficiently accurate or robust for many real-world applications.
1.3.2 Deep learning-based methods
The rise in popularity of deep learning-based methods began in 2012, when the con-
volutional neural network (CNN; a particular class of DNN optimized for analysing visual
imagery; details follow) AlexNet [18] won the ILSVRC2012 [72] image classification compe-
tition by a landslide, achieving a Top-5 classification error of 16.4% and thus outrunning the
closest competitor by 9.8% [72]. In just a few years, image classification became dominated
by DNN-based methods (cf. Figure 1.7). A similar shift occurred in object detection by
2014, when the Regions with CNN features (R-CNN)[73] model was introduced and con-
vincingly outperformed its closest competitor, the classical methods based Deformable Parts
Model (DPM) v5 [74]. On the PASCAL VOC07 dataset, R-CNN achieved a mean average
precision of 58.5%, compared to 33.7% by DPM-v5. The same transition occurred in the
domain of semantic segmentation: today’s state-of-the-art methods are DNN-based, and as
demonstrated by a recent benchmarking study by Plaksyvyi et al. [75], they substantially
outperform their classical forerunners.
Figure 1.7: Top-5 error rate of image classifiers that won ILSVRC. The depth of the models
is provided as well; note that the error rate decrease corresponds to increase in the depth of
the models. Also note that since ResNet [76], the state-of-the-art models have outperformed
humans (the last bar) on this challenge. Adapted from [5].
20
As a result of the expansion of deep learning, DNNs have largely replaced classical meth-
ods for the majority of image understanding tasks. As O’Mahony et al. [15] point out in their
comprehensive comparison of deep learning with classical ML methods, the latter are still
used nowadays when image understanding tasks can be solved by simple means, such as pixel
counting or colour thresholding, when algorithms must run on low-power devices incapable
of supporting resource-hungry DNNs, or as an auxiliary method alongside DNNs. However,
in most other cases, DNNs are preferred over classical approaches. The primary reason is
their superior performance, yet several other factors contribute to their popularity. One key
advantage is that it is not necessary to hand-craft features for them (cf. Figure 1.4 (b)),
as DNNs learn directly from data through end-to-end learning. Additionally, their accuracy
scales with the increase of the data available for training DNNs, and transfer learning (see
Section 1.4.2) allows DNNs to apply knowledge learned from large datasets to smaller ones.
Finally, the availability of open-source frameworks for deep learning, such as TensorFlow
[77], Keras [78], and PyTorch [79] has also facilitated the rise of deep learning to popularity.
As for the drawbacks of deep learning, the most notable of them are that DNNs require a
large amount of data and computing power for training. However, the vast amounts of data
available on the Internet and the powerful graphics processing units (GPUs) available on the
hardware market help mitigate these challenges.
Due to its popularity, deep learning is currently a vast and very rapidly evolving field of
research in both academic and industrial domains, with a wealth of literature and Internet
sources available. Some of the best surveys of this field have been written by LeCun et al.
[80], Pouyanfar et al. [81], and Alom et al. [82]; furthermore, excellent introductions to the
field can be found in textbooks by Goodfellow et al. [14], Howard and Gugger [83], Chollet
[84], Glassner [85], and Zhang et al. [86], among many others. In the following brief overview
of deep learning, I follow the path commonly found in these sources and approach the task
of outlining the underlying principles of deep neural networks by starting with the simplest
building block of a neural network: the single artificial neuron.
Single neuron model
The development of the artificial neuron model was originally inspired by the structure and
functionality of its biological counterpart (cf. Figure 1.8 (a)). The first such model, which
implemented simple logic gates, was proposed by Pitts and McCulloch [2]; later, Rosenblatt
[87] advanced it by improving the architecture and the learning algorithm of the model, thus
making it capable of performing linearly separable classification tasks. The direct descen-
dant of the McCulloch–Pitts neuron and the perceptron, the modern artificial neuron (cf.
Figure 1.8 (b)), is described by the formula
y = f
(
n∑
i=1
wixi + b
)
, (1.14)
where xi is the i-th input to the neuron, wi is the weight that the i-th input to the neuron
is multiplied by, b is the bias term, f is the activation function, and y is the output of the
neuron. Remarkably, the artificial neuron retains some similarities to biological neurons: the
inputs x1, x2, x3, ..., xn of the artificial neuron are similar to the signals from other neurons
that a biological neuron receives via dendrites, the weights w1, w2, w3, ..., wn that these
inputs are multiplied by – to the strength of synaptic connectivity, the summation function∑
of the artificial neuron – to the aggregation of the activations received via dendrites in the
body of the neuron, and the activation function f of the artificial neuron – to the threshold
21
for firing in a biological neuron.
(a)
(b)
Figure 1.8: Schematic representation of (a) a biological neuron; (b) an artificial neuron.
Reproduced from [85] and [88].
However, it should be emphasised that these similarities are observed at a rather high
level of abstraction; moreover, artificial neurons significantly diverge from the structure and
function of biological neurons to meet their specific functional and computational require-
ments. In particular, two essential features of the model of the modern artificial neuron, its
activation function and optimisation algorithm, are driven by practical considerations, such
as the desired output range and learning efficiency, rather than by striving for biological
plausibility. The most commonly used activation function in a modern artificial neuron is
the Rectified Linear Unit (ReLU) function:
f(x) = max(0, x). (1.15)
ReLU was originally introduced in the context of artificial neural networks by Fukushima [89]
and gained popularity through the work of Nair and Hinton [90]. As Bhumbra [91] observes,
while earlier popular activation functions, such the Heaviside step function (Figure 1.9 (a))
and the sigmoid function (Figure 1.9 (b)), were biologically inspired – the former resembling
the ‘all-or-none’ firing property of a biological neuron, and the latter resembling the firing
rate of a biological neuron – ReLU (Figure 1.9 (c)) has no known physiological correlates.
22
Nevertheless, ReLU offers several advantages over both the Heaviside step function and the
sigmoid function, especially in neural networks with multiple artificial neurons. In particular,
compared to the Heaviside step function,
f(x) =
{
1 if
∑n
i=1wixi + b ≥ 0
0 otherwise,
(1.16)
the advantage of ReLU is that its output is continuous rather than discrete, allowing to lever-
age efficient gradient descent-based optimisation approaches (discussed later). In comparison
with the sigmoid function,
σ(x) =
1
1 + e−x
, (1.17)
ReLU is less computationally expensive and helps mitigate the vanishing gradient problem,
which can cause learning to stall in neural networks.
(a) (b) (c)
Figure 1.9: Activation functions of artificial neurons: (a) the Heaviside step function; (b) the
sigmoid function; (c) the ReLU function.
The foundational algorithm for optimising artificial neurons, gradient descent, updates
the weights w1, w2, w3, ..., wn
5 of a single neuron as follows:
wi,new = wi − α · ∂L
∂wi
(1.18)
where wi,new is the updated weight wi after a single iteration of gradient descent, α is the
learning rate – a small (typically in the range from 0.1 to 0.001, depending on such factors as
the size of the model and characteristics of a training dataset) constant determining the size
of the update step – and ∂L
∂wi
is the partial derivative of the loss function L (e.g., binary cross-
entropy loss, or mean squared error loss) with respect to the weight wi. The loss function
measures the error between the predicted output of the neuron and the ground truth, i.e., the
actual target values. Although gradient descent has no direct neural correlates, it has proven
to be an efficient optimisation algorithm for artificial neurons and is widely used today as a
constituent part of the backpropagation algorithm [92, 93] to implement learning in artificial
neural networks of different scale, from a single neuron to large and complex DNNs.
5For the sake of simplicity, the bias term b is treated as yet another weight for the input x = 1.
23
Fully connected artificial neural networks
As previously mentioned, an artificial neuron is an essential concept for understanding deep
learning. However, a single neuron alone does not have the capacity to solve image under-
standing tasks6 due to their high complexity. The capacity to solve such tasks emerges when
artificial neurons are assembled into a network. A simple instance of such a network consists
of an input layer, an output layer, and one or more intermediate layers, known as hidden
layers. As the information flows in one direction, from input to output, the network is called
a feedforward neural network; moreover, as each neuron n
(l)
1 , n
(l)
2 , n
(l)
3 , ... n
(l)
n in layer l is
connected both to every neuron n
(l−1)
1 , n
(l−1)
2 , n
(l−1)
3 , ... n
(l−1)
k in layer l − 1 and to every
neuron n
(l+1)
1 , n
(l+1)
2 , n
(l+1)
3 , ... n
(l+1)
m in layer l+1 (where k,m = n or k,m ̸= n), the network
is fully connected (Figure 1.10).
Figure 1.10: Schematic representation of the architecture of a fully connected feedforward
neural network. Reproduced from [94].
A good example of the capacity of a simple fully connected feedforward network can be
found in Nielsen’s now-classic textbook on neural networks [95]: an artificial neural network
(ANN) with a single hidden layer of 30 neurons achieves an accuracy of about 95% on the
task of classifying hand-written digits in the MNIST dataset [40]. A sceptic might argue
that, for a number of reasons, the simplicity of this example ANN is only ostensible, both
architecture- and usage-wise. For instance, the number of the neurons in the hidden layer
has be determined heuristically, finding the optimal point between underfitting and overfit-
ting. While there are now relatively clear guidelines for choosing an activation function, this
decision has historically been more complex. Additionally, the backpropagation algorithm
becomes more intricate in a multi-neuron, multi-layer network, and its learning rate requires
adjustment by trial and error. While these arguments are valid7, the remarkable simplicity
of neural networks lies in the fact that once their building blocks – such as artificial neurons
and learning algorithms – are designed, they can be applied to many other tasks without
requiring any fundamental changes.
Indeed, there is nothing about the design of the above ANN that indicates it is specifically
intended for classifying hand-written digits. While the sizes of the input layer, hidden layer,
and output layer had to be adjusted to match the input image size, the complexity of the
data, and the number of classes in the dataset, there was no need for feature engineering. In
other words, it was not necessary to explicitly indicate that the digit zero resembles an oval,
or that the digit nine looks like a circle with a squiggle under it, etc.
6Apart from some very simple cases, such as pixel intensity-based decision making.
7The importance of these architectural and learning considerations is one of the reasons why it took several
decades for the ANN and deep learning (DL) paradigms to become practically operational for such complex
tasks as image understanding.
24
As a result, the same architecture can be successfully applied to classify very different
images. To exemplify that, I implemented the same architecture as above in the Keras
framework [78] and trained it in the Google Colab environment8 on an NVIDIA A100 GPU,
following the same procedure as in Nielsen [95] – 30 epochs of training, stochastic gradient
descent learning algorithm with α = 3.0, the mini-batch size of 10 – on a more recent and
complex alternative to MNIST, the Fashion MNIST dataset [96](see the visual comparison
between the two in Figure 1.11).
To ensure comparability, I also reproduced Nielsen’s MNIST experiment [95] and repeated
each experiment 30 times. As a result, similar to what Nielsen reported, the model in the
MNIST experiment achieved a mean accuracy of 94.6% (SD = 0.0036, variance = 1.31e-05)
on the test data. As expected, the performance of the non-adapted model on the Fashion
MNIST dataset was substantially lower, achieving a mean accuracy of 84.8% (SD = 0.0085,
variance = 7.24e-05) on the test data. However, this is still significantly higher than the
accuracy of a putative ‘naive’ classifer (100%/10 classes with equal sample size = 10%) and
serves as a fitting example of the generalisability of ANNs to new data.
Figure 1.11: Examples from MNIST (top) and Fashion MNIST (bottom) datasets. Fashion-
MNIST columns represent the following classes: T-shirt, trousers, pullover, dress, coat,
sandal, shirt, sneaker, bag, and ankle boot. Reproduced from [97].
Convolutional neural networks
Despite the advantages of fully connected ANNs, they also suffer from certain drawbacks
when applied to image understanding tasks. In particular, they do not take into account the
hierarchical patterns or spatial relationships in data, as they treat each input feature – such
as a pixel value in a grayscale image, or a channel value in an RGB image – independently
rather than considering its relation to spatially adjacent features. This drawback may not be
particularly evident when fully connected ANNs are trained on simple datasets like MNIST,
where the target object occupies the entirety of the image and is centred. However, it
8https://colab.research.google.com/; accessed 10 September 2024.
25
negatively affects the performance of such architectures on more complex datasets, such as
ImageNet [19], or in real-world applications. Additionally, the dense connectivity in fully
connected ANNs is not particularly computationally efficient due to the high number of
parameters, which may impact both training and inference performance of such networks.
Convolutional neural networks (CNNs), the most popular architecture for image
understanding, have revolutionised the use of ANNs in computer vision [98] and are also
the primary tool used in the research reported in this thesis. CNNs successfully address the
challenges of utilising spatial patterns in image data and making network connectivity more
sparse, thereby improving efficiency.
Historically, the design of CNN was inspired by Hubel and Wiesel’s [99] experimental
research on the processing of the visual spatial information in the cat brain. Fukushima’s
Neurocognitron [100] was the earliest attempt to integrate the principles of visual shift in-
variance discovered by Hubel and Wiesel in biological systems into the design of artificial
neural networks. Building on this foundational work to develop robust applications, LeCun
– often dubbed the father of CNNs – along with his collaborators, improved the methodology
of training models invariant to pattern shifts by integrating the backpropagation algorithm
into it. This led to state-of-the-art results at the time in image understanding tasks such
as document [101] and handwritten zip code [102] recognition. Finally, as previously men-
tioned, a revolutionary breakthrough in the development of CNNs occurred in 2012, when
the AlexNet [18] won the ILSVRC competition [72] and sparked an exponential surge in the
creation of new CNN architectures and their practical applications.
While the range of available CNN architectures is broad, from the early pioneering LeNet
[101] to the most recent state-of-the-art models, a typical CNN architecture consists of three
main types of layers: convolutional, pooling, and fully connected layers (Figure 1.12).
Figure 1.12: Architecture of a convolutional neural network. Reproduced from [98].
Convolutional layers are the central building blocks of CNNs. Each layer consists of a
number of kernels – small matrices with weights that a CNN learns during training. A kernel
operates by sliding across the input that it receives from the previous layer and performing
dot product multiplication with the part of the input it covers9 (Figure 1.13). As a result,
each kernel produces a feature map that indicates the presence of specific features, such
as edges or textures, in its input layer (Figure 1.14). Feature maps tend to increase in
complexity in deeper layers of CNNs: while activations in the initial layers resemble edge
detection, activations in deeper layers respond to more complex patterns. Mathematically,
the basic form of a convolution applied to a two-dimensional image can be expressed by the
following formula:
9Which, to revisit neurophysiological analogies once again, corresponds to the receptive field of a biological
neuron.
26
S(i, j) = (I ∗K)(i, j) =
∑
m
∑
n
I(m,n)K(i−m, j − n), (1.19)
where I denotes the image, I(m,n) is the pixel value of the input image at position (m,n),
K represents the kernel, and S(i, j) is the output of the convolution operation at position
(i, j).
Figure 1.13: The basic form of a convolution of a two-dimensional image. The boxes with
arrows are drawn to demonstrate how the dot product of the kernel with the upper left
element of the input forms the corresponding element of the output. Reproduced from [14].
Figure 1.14: An example of edge detection with a convolutional kernel. Note that, for
illustrative purposes, the kernel used in the example is composed of predefined integers.
Such an approach is characteristic of traditional image processing, whereas CNNs employ
floating-point kernels that learn their weights during training. Reproduced from [103].
Some essential aspects of implementing convolutional layers include the choice of activa-
tion function, the number of channels in a kernel, and the selection of padding and stride size.
To introduce non-linearity, ReLU or other similar activation function is typically applied to
the feature map before it is passed to the next layer. Furthermore, since the input to a
convolutional layer typically has multiple channels, each kernel must have as many channels
as the input. During convolution, each channel is convolved separately, and the resulting
values are summed to produce a single output value for each location in the feature map
27
(Figure 1.15). Padding the image with zeros or values of neighbouring pixels is often used
to ensure that image size does not shrink after each convolution. While the default stride
size for a kernel is 1, larger strides can be used for the sake of dimensionality reduction or to
increase the computational efficiency of the network.
Figure 1.15: Application of convolutions to an input that has multiple channels. Note that
each kernel must have as many channels as the input, but the output consists of one channel
per kernel. Reproduced from [85].
Overall, the use of convolutional layers provides the advantage of sparsity, as the weights
in a given convolution kernel are the same for all the input values it is applied to. This results
in fewer weights for the network to learn compared to fully connected layers. Additionally,
applying the same kernel across the entire input allows the network to detect features cor-
responding to the given kernel regardless of their location – an important advantage over
networks relying solely on fully connected layers.
Pooling layers typically follow convolutional layers in CNN architectures. Their primary
function is to reduce feature maps produced by the preceding convolutional layer, thereby
decreasing the number of parameters and reducing the overall complexity of the network. The
two main types of pooling in CNN are average pooling and max pooling. Average pooling
outputs the average value of all inputs in a rectangular window of a certain size (Figure 1.16
(a)) and is expressed by the following formula:
Pavg(x, y) =
1
N
N∑
i=1
xi (1.20)
Max pooling, on the other hand, outputs the largest value of all inputs within a rectangular
window of a certain size (Figure 1.16 (b)) and is expressed as:
Pmax(x, y) = max(x1, x2, ..., xN) (1.21)
Fully connected layers are typically the final layers in a CNN. Their purpose is to map
the features learned by the preceding layers – mainly convolutional layers followed by pooling
layers – onto the output of the network, making it possible for CNNs to perform tasks such as
image classification, object detection, semantic segmentation, and other image understanding
tasks.
28
(a) (b)
Figure 1.16: Visual examples of pooling operations: (a) average pooling; (b) max pooling.
Reproduced from [104].
In the remainder of this section, I will describe the specific CNN models used in the
research presented in this thesis.
Image classification models. For image classification tasks reported in this thesis, I
used three CNN models: MobileNetV2 [105], MobileNetV3Large [106], and EfficientNet-B7
[107].
MobileNets is a family of compact CNNs designed to balance accuracy and latency. To
date, there have been three generations of MobileNets: the original MobileNet [108], Mo-
bileNetV2 [105], and MobileNetV3 [106]. As the name suggests, they were developed with
the goal of making it possible to deploy them on mobile and other low-power devices, such
as edge and embedded devices.
The first generation, the original MobileNet, utilises depthwise separable convolutions,
originally introduced by Sifre [109] and later implemented in Inception CNN models [110].
While both Inception models and MobileNet use depthwise separable convolutions for the
same purpose – to reduce the computational load – there is a substantial difference in how ex-
tensively this type of convolution is applied in the two architectures. Namely, while depthwise
separable convolutions are only used in the few initial layers of Inception CNNs, MobileNet
is primarily build from layers implementing them.
The core concept of depthwise separable convolutions is to split the two main components
of standard convolution operations – convolving the input to the layer with kernels, and com-
bining the results of convolutions to obtain new representations – into two distinct steps. To
do that, two layers are used: a depthwise convolutional layer, and a pointwise convolutional
layer, thereby reducing the computational cost. To elaborate, the computational cost C of a
standard convolution is:
29
Cstd = DK ·DK ·M ·N ·DF ·DF [108], (1.22)
where DK×DK is the size of the kernel, DF ×DF is the size of the input to the convolutional
layer, M is the number of channels in the input, and N is the number of output channels.
The cost of depthwise separable convolution is the sum of the costs of depthwise convolution
and 1× 1 pointwise convolution:
Cdw = DK ·DK ·M ·DF ·DF +M ·N ·DF ·DF [108] (1.23)
Since the steps of convolving the input and combining the results of the convolutions are
performed jointly, a reduction in computation is achieved:
Cdw
Cstd
=
DK ·DK ·M ·DF ·DF +M ·N ·DF ·DF
DK ·DK ·M ·N ·DF ·DF =
1
N
+
1
D2K
[108] (1.24)
To sum up, depthwise separable convolutions reduce computations in comparison to stan-
dard convolutions by the factor of ≈ K2, where K is the size of the kernel DK ×DK [105].
Thus, in case of MobileNet, which uses 3× 3 depthwise separable convolutions, the compu-
tational cost is reduced by ≈ 9 times.
Another means of improving performance in MobileNet is the use of the ReLU6 activation
function instead of the standard ReLU, that is, confining the output to a maximum of 6,
ensuring it remains within the range [0, 6]:
f(x) = min(max(0, x), 6). (1.25)
The purpose of using ReLU6 rather than ReLU is to improve the robustness of low-precision
computations.
As a result of these improvements, MobileNet, whose architecture consists of an initial
3×3 standard convolution layer with stride 2, followed by 13 depthwise separable convolution
blocks (see Figure 1.17 (a)), and then a global average pooling layer and a fully connected
layer, contains far fewer parameters than a model with the same architecture using full con-
volutions – 4.2 million vs 29.3 million – yet still achieves comparable accuracy, 70.6% vs
71.7%, on the ImageNet classification task [108].
The next generation of MobileNets, MobileNetV2, features substantial modifications to
the main building blocks of the network. While the original MobileNet has 2 layers in each
building block, in MobileNetV2, each building blocks consists of three layers (Figure 1.17,
(b) and (c)):
• The first layer is a 1 × 1 convolution layer; it increases the channel dimension of the
input feature map. This expansion is necessary to improve the capacity of depthwise
convolution, as otherwise, its capacity would be lower than that of regular convolution.
• The second layer is a 3× 3 depthwise convolution layer.
• The third layer is another 1× 1 convolution layer, which shrinks the expanded feature
map back to its original dimension. Unlike the original MobileNet, this layer in Mo-
bileNetV2 does not use an activation function, as the authors found that using linearity
rather than nonlinearity in this layer prevents excessive loss of information.
As the overall narrow-wide-narrow sequence is known as the inverted bottleneck, this
three-layer building block in MobileNetV2 is respectively referred to as the mobile inverted
30
(a) (b) (c)
Figure 1.17: Building blocks of MobileNetV1 (a) and MobileNetV2: (b) with stride = 1; (c)
with stride = 2. Note the skip connection in (b). Reproduced from [108].
bottleneck block [111]. The stride of depthwise convolutions in the mobile inverted bottleneck
block is either 1 or 2. In the case of stride, an additional feature – a skip connection
outputblock = f(inputblock) + inputblock (1.26)
– is added (see Figure 1.17 (b)) to improve gradient flow across multiple layers [105].
With these improvements, MobileNetV2 achieves a better accuracy on ImageNet – 72.0%
vs 70.6% – than its predecessor, MobileNetV1, while using less parameters, 3.4 million vs 4.2
million.
The architecture ofMobileNetV3, available in two sizes – Small10 and Large – was designed
by Howard et al. [106] through a two-stage search.
The first stage of the search was platform-aware neural architecture search (NAS) fol-
lowing the methodology outlined by Tan et al. [112]. Since both studies used the same
RNN-based optimizer and architecture search space, Howard et al. [106] found similar re-
sults to Tan et al. [112] for their Large model, for which the target inference latency was
set to ≈ 80 ms. Therefore, the Large architecture selected after the first search stage was
one of the architectures designed by Tan et al. [112], MnasNet-A1. MnasNET models build
upon the MobileNetV2 architecture by incorporating squeeze and excitation operations [113]
into the mobile inverted bottleneck blocks. The squeeze operation is applied to the feature
map X of a convolutional layer with the dimensions H ×W × C (where H is height, W is
width, and C is the number of channels) to obtain a channel descriptor – a vector z with
dimensions 1× 1×C. Each entry zc of z is obtained by shrinking the respective channel Xc,
with dimensions H ×W , into a single value using global average pooling (GAP):
zc = GAP(Xc) =
1
H ×W
H∑
i=1
W∑
j=1
Xc(i, j) (1.27)
Next, the excitation operation uses z to recalibrate the channels of X. To do this, z is first
10As I did not use MobileNetV3Small in the research reported in this thesis, I will not discuss it in detail
here.
31
passed through two fully connected (FC) layers, W1 and W2, to obtain the scaling vector s:
s = σ(W2ReLU(W1 z)), (1.28)
where σ denotes the sigmoid function. The vector s is then used to rescale X by multiplying
each channel Xc by a scalar sc:
X˜c = sc ×Xc (1.29)
The goal of the entire X→ X˜ transformation, using squeeze and excitation, is to improve the
representative power of the CNN [113], thereby improving its performance without adding
significant computational overhead.
The second stage of architecture search employed the NetAdapt algorithm [114], which
was tuned to meet the needs of the study. The modified NetAdapt algorithm refined the
architecture obtained in the first stage by iteratively generating a set of new proposals, each
of which modified the architecture obtained in the previous iteration in a way that reduced
the inference latency by at least δ = 0.01|L|, where L was the latency of the seed model.
After generating a proposal, the pretrained model from the previous step was adjusted to the
proposed architecture, fine-tuned for 10 000 steps, and evaluated on the target metric, which
was maximise the ratio ∆Acc
∆latency
. Here, ∆Acc is the change in the accuracy of the model, and
∆latency is the change in the latency of the model (where ∆latency ≥ δ). Once the target
inference latency of 80 ms was achieved, the weights of the final model were obtained by
training the resulting architecture from zero on ImageNet [19].
In addition to the two-stage architecture search, Howard et al. [106] improved perfor-
mance by redesigning expensive layers, e.g., the last few layers of the network and the initial
convolutional layers. They also replaced ReLU with the swish activation function in some
blocks of MobileNetV3. Swish, introduced by Ramachandran et al. [115], is given by the
formula
swish(x) = x · σ(x).
Swish has been shown to improve accuracy over ReLU-based layers. However, it can incur
computational overheads when used on mobile devices due to the cost of computing the
sigmoid function, which is resource-intensive for mobile CPUs [106]. Therefore, Howard et
al. [106] replaced swish with hard-swish (h-swish), its piece-wise linear analogue:
h-swish(x) = x · ReLU6(x+ 3)
6
Furthermore, as Howard et al. [106] experimentally discovered that most of the benefits from
using swish come when this activation function is used in the deeper layers of the network,
MobileNetV3Large uses h-swish in the first block, blocks 2 to 7, and block 19, while blocks
8 to 17 retain the same activation function as in MobileNetV2, i.e., ReLU6.
As a result of these improvements, MobileNetV3Large achieves 75.2% Top-1 accuracy on
ImageNet, compared to 72% for MobileNetV2, all while reducing latency by 20% compared
to the latter.
EfficientNet-B7 is the largest model in the EfficientNet family of CNNs, developed by Tan
and Le [107] with the goal of achieving the same accuracy as state-of-the-art classifiers while
having fewer parameters (hence ‘efficient’ in ‘EfficientNet’). Similar to MobileNetV3, the
baseline model, EfficientNet-B0, was developed through an automated neural architecture
32
search proposed by Tan et al. [112], optimising the model for both accuracy and computa-
tional efficiency. The goal of the optimisation was set as:
maximize
m
Accuracy(m)×
[
FLOPs(m)
T
]w
[112, 107] (1.30)
where m is the model, Accuracy(m) is the accuracy of the model, FLOPs(m) is the number
of the floating-point operations (FLOPs), which measures the computational demands of the
model, T is the target number of FLOPs (set to 400 million by the authors), and w is a
hyperparameter controlling the trade-off between accuracy and FLOPs, set to −0.07 by the
authors.
The same as with MobileNetV2, the main building blocks of EfficientNet-B0 are mo-
bile inverted bottlenecks, with added squeeze-and-excitation optimisation [113]. Overall,
EfficientNet-B0 baseline model consists of an initial standard convolutional layer with 3× 3
kernels, 16 mobile inverted bottleneck layers with kernel sizes of either 3 × 3 or 5 × 5, and
the final layers – a convolutional layer with 1 × 1 kernel size, a pooling layer, and a fully
connected layer.
To design the rest of the models in the EfficientNet family, a compound scaling method was
used. As Tan and Le observe [107], the typical approaches to scaling CNNs to achieve better
accuracy are to increase their depth, width, or, less commonly, image resolution. Usually,
only one of these dimensions is scaled at a time, as scaling two or three arbitrarily would
require many experiments based on trial and error approach, making it computationally
expensive. Instead, Tan and Le [107] propose a more advanced approach – to uniformly scale
up the depth, width, and image resolution using a set of fixed scaling coefficients. To do that,
one needs to specify the compound coefficient ϕ to indicate how many more computational
resources are available for the scaled-up model compared to the baseline model:
ϕ = log2
(
FLOPsscaled
FLOPsbaseline
)
, (1.31)
where FLOPsscaled is the number of FLOPs allocated to the scaled-up model, and
FLOPsbaseline is the number of FLOPs for the baseline model. Next, the coefficients for
depth d, width w, and image resolution r of the scaled-up model are set as:
depth: d = αϕ
width: w = βϕ
resolution: r = γϕ
s.t. α · β2 · γ2 ≈ 2
α, β, γ ≥ 1
[107] (1.32)
where α, β, andγ are constants determined via a small-scale grid search optimising for accu-
racy:
max
d,w,r
Accuracy(N (d, w, r))
s.t. N (d, w, r) =
⊙
i=1,...s
Fˆ d·Lˆii
(
X⟨r·Hˆi,r·Wˆi,w·Cˆi⟩
)
Memory(N ) ≤ target memory
FLOPs(N ) ≤ target flops.
[107] (1.33)
33
Here d, w, and r, as in Equation 1.32, are the coefficients for network depth, network width,
and image resolution, respectively; FLii is layer Fi repeated Li times in the i-th block of the
network; ⟨Hi,Wi, Ci⟩ is the shape of input tensor X for layer i.
As a result, after setting ϕ to 1 for the baseline model EfficientNet-B0, which implied
search for EfficientNet-B1 with FLOPsscaled
FLOPsbaseline
= 2, the grid search returned the following opti-
mal coefficients for EfficientNet-B0: α = 1.2, β = 1.1, and γ = 1.15. These values of α, β,
and γ were then fixed as constants, and by increasing the value of ϕ, EfficientNet-B1 to B7
were derived. The largest model, EfficientNet-B7, achieved a marginally better accuracy of
84.4% on the ImageNet dataset compared to 84.3% by its closest competitor at the time, the
GPipe model [116], all while having considerably fewer parameters – 66 million versus 557
million.
Semantic segmentation models. For semantic segmentation tasks, I use two im-
age classification models – Xception [117], and MobileNetV2 [105] – both extended with a
DeepLabv3 [118] segmentation head.
Xception, released by Chollet in 2017, is a modification of the Inception type of archi-
tecture of CNNs, originally introduced by Szegedy et al. [119] and further developed as
Inception-v2 [110], Inception-v3 [120], Inception-v4 [121], and Inception-ResNet [121]. The
main building blocks of Inception models are the eponymous modules. While the implemen-
tation details of these modules vary from one generation of Inception networks to another,
an Inception module, in its simplified form, consists of a set of 1×1 kernels followed by 3×3
kernels, with the resulting feature maps concatenated (Figure 1.18).
Figure 1.18: Architecture of a simplified Inception module. Reproduced from [117].
The function of 1×1 kernels is to find cross-channel correlations, while the function of 3×3
kernels (or larger, e.g., 5×5 kernels, in other versions of Inception modules) is to find spatial
correlations. Since these operations are decoupled in Inception modules, the underlying
hypothesis of Inception, as Chollet [117] observes, is that these operations are sufficiently
independent from each other that it is better to keep them separate rather than combine them,
as a regular convolutional layer would. In Xception, this hypothesis is taken further (hence
‘Xception’, which stands for ‘Extreme Inception’) by postulating that these operations can
be entirely decoupled by replacing Inception modules with depthwise separable convolutions.
The resulting Xception architecture consists of 36 depthwise separable convolutional layers
with residual connections. These layers form three major parts of the model: the entry flow,
34
through which the input data passes first, the middle flow, repeated eight times, and the
exit flow (Figure 1.19). Xception has the same number of parameters as its predecessor,
Inception-v3 [120]; it demonstrated marginally better accuracy than Inception-v3, 79% vs
78.2%, on the ImageNet dataset, and substantially better accuracy than Inception-v3 on
JFT, an internal Google dataset [117].
Figure 1.19: Architecture of Xception [117]. Reproduced from [117].
For semantic segmentation, Xception is used as the backbone of the DeepLabv3 [118]
model. The models in the DeepLab family are designed for repurposing image classifiers for
semantic segmentation tasks. The first DeepLab model was introduced by Chen et al. [122]
in 2014 and was soon followed by DeepLabv2 [123], v3 [118], and v3+ [124]11.
The backbone of DeepLabv1 is VGG-1612 [125] – a 16-layer CNN image classifier, which
was a state-of-the-art CNN at the time DeepLabv1 was released. To adapt VGG-16 for se-
mantic segmentation, the last max pooling layers are removed, and its fully connected layers
are replaced by convolutional layers to preserve dense spatial information that would other-
wise be lost, as fully connected layers would collapse it into a single vector. However, these
changes alone are not sufficient, as standard convolutions would yield very sparse feature
maps, a common issue when repurposing CNN-based image classifiers for semantic segmen-
tation. One way to address this is by using deconvolutional layers to produce more refined
segmentation masks [126]. However, as Chen et al. point out [122], this substantially in-
11As Deeplabv3+ was not used in the research reported in this thesis, I do not discuss it here.
12Named after the Visual Geometry Group at the University of Oxford, where the model was developed.
35
creases the complexity and therefore the training time of the model. Therefore, DeepLabv1
employs an alternative approach: atrous (from the French a` trous – ‘with holes’) convolu-
tions. Atrous convolutions, also called the hole algorithm and dilated convolutions, were
originally introduced in the field of signal processing to efficiently compute the undecimated
wavelet transform [127]. In the context of computer vision, this method involves either in-
serting zeros in convolutional kernels to increase their size, or, more efficiently, keeping the
kernels unchanged but sparsely sampling the feature map they convolve (Figure 1.20). Atrous
convolutions are given by the formula:
y[i] =
∑
k
x[i+ r · k]w[k] [118], (1.34)
where i is each location on the output y and the feature map x, w is the kernel, and r is
the atrous rate, that is, the stride with which the input signal is sampled. In VGG-16, as
the backbone of DeepLabv1, three last convolutional layers are expanded by a factor of 2
(i.e., r = 2), and the subsequent convolutional layer (which was a fully connected layer in
the original model) is expanded by a factor of 4 (i.e., r = 4).
(a) (b)
Figure 1.20: Illustration of atrous convolutions: (a) different rates of atrous convolutions for
a kernel size of 3×3; (b) atrous convolution in 2D. Top row of (b): using standard convolution
on a feature map with low resolution leads to the extraction of sparse features. Bottom row
of (b): using atrous convolution (rate = 2) on a feature map with high resolution leads to
the extraction of dense features. Adapted from [123] and [118].
Another hallmark of DeepLabv1 is the use of a fully connected CRF [70] to address the
inherent problem that arises when CNN-based image classifiers are repurposed for semantic
segmentation, namely, that classifiers need to be invariant to spatial transformation, whereas
semantic segmentation models require spatial accuracy [122]. The use of the CRF is an
efficient solution for this problem, as it can capture edge details while accounting for long-
range dependencies. Furthermore, at the time of the release of DeepLabv1, the CRF was
more computationally efficient than its alternatives [122].
By integrating these features with VGG-16, DeepLabv1 achieved both better speed and
accuracy, improving by 7.2% in mIoU compared to the nearest competitor on the PASCAL
VOC dataset, while keeping its overall architecture of the system simple, as it essentially
consists of two modules: a CNN, and a CRF module [122].
DeepLabv2 further improves the accuracy of semantic segmentation by efficiently address-
ing the issue that that objects can appear at different scales in input images – e.g., a tree
36
might take up only a part of an image, or its entirety. To handle this, a model needs fil-
ters with fields of view of different size to segment objects efficiently. As the authors of
DeepLabv2 observe [123], the standard approach prior to their work was to run a model on
several rescaled versions of the same image and then aggregate the resulting feature maps.
However, as they note, while this indeed improved segmentation accuracy, it also introduced
significant computational overhead. To solve this, DeepLabv2 employs multiple atrous con-
volutional layers with different sampling rates in parallel. The outputs from these layers
are processed separately and then eventually fused to obtain the final result. Chen et al.
[123] call this approach ‘atrous spatial pyramid pooling’ (ASPP; Figure 1.21) and use it
not only with VGG-16, as in the first version of DeepLab, but also with the more efficient
ResNet-101 [76], demonstrating the compatibility of DeepLabv2 with various architectures.
As Chen et al. [123] report, these improvements enable DeepLabv2 to achieve a 79.7% mIoU
on the PASCAL VOC dataset and deliver state-of-the-art results on three other datasets:
PASCAL-Context [128], PASCAL-Person-Part [129], and Cityscapes [130].
(a) (b)
Figure 1.21: Atrous spatial pyramid pooling (ASPP): (a) Schematic representation of ASPP.
Different colours of the rectangles correspond to the different fields of view of kernels with
different rates. (b) Actual architecture of the ASPP module in DeepLab V2. Adapted from
[123].
The major change in DeepLabv3, compared to previous versions, is that it does not employ
the CRF anymore, allowing the model to be trained end-to-end13 and increasing its compu-
tational efficiency, as the CRF had introduced substantial computational overhead [118]. To
replace the CRF, DeepLabv3 implements two key changes. First, batch normalisation [110] is
introduced, which standardises the inputs to the layers of the model across each mini-batch.
Second, the ASPP module is enhanced by incorporating image-level features that encode
global context, helping to mitigate the issue of a decreasing number of valid kernel weights
in ASPP. This issue arises because, as larger sampling rates are used in ASPP, fewer weights
are applied to the valid feature region, with more being applied to the padding with zeros
[118]. In practice, that entails two main changes to the implementation of the ASPP module
(Figure 1.22):
• adding a 1× 1 convolution layer;
• applying global average pooling to the feature map produced by the layer preceding the
ASPP module, passing the obtained image-level features through a 1 × 1 convolution
with 256 kernels, and then bilinearly upsampling the output back [118].
13Previously, with the CRF, the training consisted of two stages.
37
As a result of the above improvements, DeepLabv3 achieved an mIoU of 85.7% on the
PASCAL VOC 2012 test set while also improving computational efficiency by eliminating
the CRF.
Figure 1.22: Overview of the DeepLab V3 architecture. Reproduced from [131].
While the original DeepLabv3 paper reports only experiments with ResNet-50 and ResNet-
101 [76] CNNs, the proposed system is compatible with a broad range of CNN-based classi-
fiers. In particular, the DeepLabv3 repository14 includes models based on Xception [117] and
MobileNetV2 [105]. To repurpose the original Xception classifier for semantic segmentation,
the authors of the repository implemented the following changes:
• increased the number of the layers, making the model available in three versions – with
41, 65, and 71 layers, respectively;
• replaced max pooling operations with atrous separable convolutions;
• added batch normalisation and ReLU activation after each 3×3 depthwise convolution.
In the adaptation of MobileNetV2 for semantic segmentation, the ASPP module is omitted
to enable faster computation.
Object detection models. For research involving object detection tasks reported in
this thesis, I use the YOLOv5 model. Models from the YOLO (You Only Look Once)
family, as the name suggests, can detect objects after just a single forward pass of the
input image. This makes such single-stage object detectors particularly fast and suitable
for real-time applications, especially in comparison with two-stage object detectors such as
R-CNN [73] and its successors, Fast R-CNN [132], Faster R-CNN [133], and Mask R-CNN
[134]. While single-stage object detectors initially tended to be less accurate than their two-
stage counterparts, over time, many versions of YOLO models have been introduced, from
the original YOLOv1 introduced in 2015 by Redmon and Farhadi [47], to the most recent
YOLOv8 [135] and YOLO-NAS (where NAS stands for ‘neural architecture search’) [136],
and as each generation has introduced improvements over previous versions, the accuracy of
the YOLO models has improved substantially.
In the following, I provide a brief overview of YOLO models from the foundational v1 to
the v5 model used in the research reported in this thesis.
14https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md;
accessed 24 September 2024.
38
YOLOv1 [47] operates by dividing an input image into a S × S (by default, S = 7) grid
of cells (Figure 1.23) with the notion that when the centre of some object in an image is
located within one of these cells, that cell is responsible for detecting the object, whereas
other cells are supposed to disregard it to prevent multiple detections. The output of each
grid cell includes:
• B bounding boxes (by default, B = 2). For each bounding box, 5 predictions are
given: x, y, w, h, and confidence scores. (x, y) are the coordinates of the centre of the
bounding box, whereas w and h are its width and height; importantly, the former pair
of coordinates is given relative to the boundaries of the grid cell, whereas the latter
pair of values is given relative to the entire image. The confidence score Sconf is defined
as:
Sconf = p(object)× IoU truthpred [137], (1.35)
where p(object) is the probability that an object is present is a given bounding box, and
IoU truthpred is the IoU between the predicted bounding box and the ground truth bounding
box.
• C class probabilities, conditioned as Pr(Class i | Object) on the grid cell containing an
object. Regardless of the value of B, only one set of C values is predicted per grid cell.
Architecture-wise, YOLOv1 consists of 24 convolutional layers (with kernel sizes of either
1 × 1 to reduce feature space, or 3 × 3) followed by two fully connected layers. Apart from
the final layer, which uses a linear activation function, all other layers use a leaky ReLU [138]
activation:
f(x) =
{
x if x ≥ 0
0.1x otherwise.
(1.36)
Overall, YOLOv1 was the fastest object detection model at the time of its release, with
the capacity to detect objects in real time. However, while it achieved an mAP of 63.4% on
the PASCAL VOC2007 dataset, its localisation error was substantially higher than that of
slower but more precise two-stage state-of-the-art detectors at that time such as Fast R-CNN
[132]. Furthermore, YOLOv1 suffered from low recall [139].
Figure 1.23: General scheme of YOLO. Reproduced from [47].
39
YOLOv2 [139] features a number of improvements over v1, including batch normalisation
for all convolutional layers, a higher input resolution of the backbone classifier (448 × 448
in v2 vs 224 × 224 pixels in v1), and the adoption of a fully convolutional architecture by
removing dense layers. Yet another substantial change is the introduction of anchor boxes –
boxes of predefined size for predicting bounding boxes. Multiple anchor boxes are defined for
each grid cell, allowing for more precise capture of the varying sizes and shapes of objects. To
use anchor boxes efficiently, their size is found by k-means clustering on the bounding boxes
from the training split of the PASCAL VOC2007 dataset, rather than being hand-picked.
The architecture of YOLOv2 is based on the custom backbone architecture Darknet-19,
which contains 19 convolutional layers and 5 max-pooling layers. The detection head, which
is added to the backbone, consists of four convolutional layers and a passthrough layer.
The passthrough layer brings features with finer resolution of 26 × 26 from an earlier layer
to concatenate them with the coarser resolution grid of 13 × 13, on which detections are
predicted, thus improving the access of the detector to fine-grained features of the input
image.
Due to the above improvements, YOLOv2 achieved an mAP of 78.6% on the PASCAL
VOC2007 dataset, a substantial improvement over YOLOv1.
To reduce localisation errors and improve the detection of small objects, YOLOv3 [140]
transitioned to the new generation of the backbone, Darknet-53. In addition to the substan-
tially increased depth of the model, which consists of 53 convolutional layers vs 19 layers in
the YOLOv2 backbone, Darknet-53 features the adoption of strided convolutions instead of
max-pooling layers and the addition of residual connections. Another hallmark of YOLOv3
is multi-scale prediction: instead of predicting detections on a single grid, as in YOLOv1 and
YOLOv2, there are three different branches of the output of the backbone with increasing
resolution – a 13× 13, 26× 26 grid, and 52× 52 grid – which improves the detection of small
objects.
Due to the advancements in object detection around the time when YOLOv3 was re-
leased, this and the following versions of YOLO have been evaluated on the Microsoft Com-
mon Objects in Context (MS COCO) dataset [141], a more challenging benchmark for object
detection than the previously used PASCAL VOC 2007, which had become too easy a chal-
lenge. YOLOv3-SPP, a version of YOLOv3 that uses a spatial pyramid pooling (SPP) block
with different kernel sizes k× k (where k ∈ {1, 5, 9, 13}) and thus has a larger receptive field,
achieved an mAP of 36.2% on MS COCO. This was a state-of-the-art result at that time,
while YOLOv3-SPP also demonstrated substantially faster inference times than its competi-
tors – RetinaNet [142] and SSD (Single Shot Detector) variants [143, 144].
YOLOv4 and subsequent versions were released by research groups other than the original
team led by Farhadi and Redmon, as Redmon stepped away from computer vision research
due to concerns about its potential use for military purposes and privacy invasion. YOLOv4,
released by Bochkovsky et al. [145], features a number of improvements over previous ver-
sions; these improvements are based on extensive experiments conducted by the authors.
Thus, to identify an optimal backbone, they explored a range of architectures, including
ResNeXt50 [146], already mentioned Darknet-53, and EfficientNet-B3 [107]. By means of
conducting experiments as well as due to theoretical reasoning that the most suitable can-
didate for the backbone would have a larger receptive field size and number of parameters
[145], they selected the Darknet-53 architecture, enhanced with cross-stage partial connec-
tions (CSPD; [147]), which they named CSPDarknet53. Similar to YOLOv3-SPP, YOLOv4
adds an SPP block to the backbone and uses multi-scale prediction; however, the SPP of
40
YOLOv4 is further modified, and while in YOLOv3-SPP the features from different layers of
the backbone are aggregated for multi-scale prediction using the Feature Pyramid Network
(FPN) method [148], YOLOv4 uses the Path Aggregation Network (PAN) method [149] for
that. The object detection head in YOLOv4 remains anchor-based, as in YOLOv3.
In addition to architectural changes, YOLOv4 introduces several further improvements,
divided into two categories: Bag-of-Freebies, and Bag-of-Specials. The Bag-of-Freebies in-
cludes methods that improve the accuracy of inference without slowing it down; cost-wise,
they either change the strategy of training the model, or increase the cost of training –
but, to emphasise again, not the cost of inference. Such methods include data augmenta-
tion techniques such as CutMix [150] and Mosaic data augmentation and other techniques
such as Complete IoU (CIoU) loss [151], Cross mini-Batch Normalization (CmBN), and self-
adversarial training. The methods in the Bag-of-Specials are the ones that slightly decrease
the speed of inference but substantially improve the accuracy of object detection. Some of
such methods implemented in YOLOv4 are the Mish activation function [152], Multi-input
weighted residual connections (MiWRC) for the backbone, and Distance IoU Non-Maximum
Suppression (DIoU-NMS; [151]) for the object detection head.
As a result of these improvements, YOLOv4 achieved an mAP of 43.5% on the MS COCO
dataset while maintaining real-time inference speed.
YOLOv5 [153], released shortly after YOLOv4 by Ultralytics under the leadership of
Glenn Jocher, marks a shift from Darknet to PyTorch [79] as its primary framework. The
adoption of PyTorch, with its larger and more established ecosystem and infrastructure, has
facilitated the wider use of YOLO (especially, as Hussain [137] notes, for mobile applications),
and has increased the number of open-source contributors to it. YOLOv5 offers four models
of different size, from the small YOLOv5s with 7.5 million parameters to the extra-large
YOLOv5x with 86.7 million parameters. Other notable changes in this version include the use
of the Sigmoid Linear Unit (SiLU; [154]) activation functions in the convolutional layers, the
adoption of advanced data augmentation techniques (including those from the open-source
Python library Albumentations [155]), and the introduction of the AutoAnchor algorithm.
AutoAnchor fine-tunes anchor boxes on the training dataset prior to training the network per
se, making them more suitable for a particular dataset, compared to the previous approach
of using anchor boxes tuned on the PASCAL VOC2007 dataset for all tasks.
Architecture-wise (see Figure 1.24 for an overview), YOLOv5 employs a modification of
the CSPDarknet53 backbone called New CSP-Darknet53, which connects to the object detec-
tion head via the Spatial Pyramid Pooling Fusion (SPPF) module – a more computationally
efficient equivalent of the SPP module – and the New CSP-PAN module, an updated version
of the PAN used in YOLOv4. The object detection head itself remains largely the same as
in YOLOv3 and YOLOv4.
Thanks to the above improvements, the largest version of YOLOv5, YOLOv5x, achieved
an mAP of 50.7% on the MS COCO dataset with an image input size of 640 × 640 pix-
els. Moreover, with an increased input size of 1536 × 1536 pixels and the use of test-time
augmentation, it achieved an mAP of 55.8% on MS COCO.
41
Figure 1.24: Schematic representation of YOLOv5. Reproduced from [156].
1.4 Datasets for image understanding tasks
Having explored image understanding models, I now turn to the discussion of the datasets
for training and evaluating them. As the reader will observe, the focus of the discussion
broadens, as many insights pertain not only to image understanding but also to deep learning
as a whole.
1.4.1 The fundamental role of the data
Much of the attention of the general public and ML researchers and practitioners alike to the
progress in the field of AI has been captured by the emergence and performance of new AI
models, be those large language models-powered chatbots such as ChatGPT [3], text-to-image
generative models such as Midjourney [157] and DALL·E [158], or the image understanding
models discussed in earlier sections. While it is hardly surprising, as these models are en-
tities that actually carry out AI tasks, it appears that such a strong focus on the models
42
may overshadow another cornerstone of AI: datasets for training and evaluation. Neglecting
the role of datasets would be particularly erroneous at the current historical stage of AI
development, since many AI subfields are presently dominated by deep learning approaches,
which are notoriously data-hungry. Therefore, while AI models are indeed at the forefront
of AI research, as Denton et al. observe, datasets ‘form the background conditions upon
which ML research and development operates: they structure how ML practitioners frame
and approach problems, inform how progress is defined and tracked within research commu-
nities, and create the grounds upon which algorithmic techniques are developed, tested, and
ultimately deployed in industry contexts’ [159].
As various authors acknowledge (see e.g. [14, 85]), the availability of large datasets for
training, evaluating and benchmarking models has been a major enabling factor for the
advancements of deep learning. Without such datasets, it would not be possible to fully
leverage the capabilities of DNNs. Open access to these datasets has sparked competition
to develop better models, democratising deep learning research and enabled researchers to
work on image understanding tasks even without the resources to acquire and label data
independently.
However, while open datasets such as CIFAR-10 and CIFAR-100 [160], MNIST [40],
ImageNet [19], PASCAL VOC [57], MS COCO [141], and Cityscapes [130] have accelerated
the development and improvement of image understanding models, real-world applications
often require specialised datasets. This leads to several data-related challenges when applying
DNNs to real-world problems, such as:
• datasets must be collected, curated and labelled, with the difficulty of labelling increas-
ing from image classification to object detection to semantic segmentation;
• datasets need to be large enough to prevent overfitting and diverse enough to avoid
bias;
• data collection may be impeded or made impossible by moral, ethical, and legal con-
cerns;
• due to concept drift, that is, the change between the class distribution at the time of
training vs the current class distribution [161], it may be necessary to acquire new data
and retrain the model on them.
Characteristic examples of such challenges are biomedical datasets (see Chapter 5): they
tend to be comparatively small, making overfitting a significant issue; acquiring additional
data is often time-consuming and expensive; labelling may require scientific expertise and a
lot of time, as images tend to be large and complex; class distribution tends to be imbalanced
[162], with more negative samples (e.g., images of cancer-free tissue samples) than positive
samples.
To mitigate the problems of data insufficiency and imbalance, one can use such approaches
as transfer learning, fine-tuning, data augmentation, and the use of synthetic data, as dis-
cussed in the following.
1.4.2 Transfer learning and fine-tuning
Transfer learning aims to improve the performance of a model on a target domain by lever-
aging knowledge learned from another domain (or several domains), referred to as the source
domain [163]. This allows models to learn effectively even when there is a limited supply
43
of training data from the target domain; furthermore, the time and computational costs for
training a model can also be substantially reduced [164]. As surveys on transfer learning
observe [165, 163], this type of knowledge transfer is quite similar to our everyday experi-
ence: for instance, if one has already learned to play some musical instrument, it will likely
be easier to learn a new instrument, as the skills needed for both tasks are similar, albeit
not the same. In the context of deep learning, the typical workflow of implementing transfer
learning is as follows [166]:
• retrieve layers from a model trained on the source domain;
• freeze them to prevent information loss during the next steps;
• add several new trainable layers on top of the frozen layers;
• train the new layers on the target dataset;
• optionally: unfreeze the entire model (or part of it) and retrain it on the target dataset
using a low learning rate.
The last of these steps, called fine-tuning, allows to gradually fit the knowledge in the
unfrozen layers to the target dataset; due to the use of a low learning rate, it is expected
that the model adaptation will not result in loss of the information in unfrozen layers.
Transfer learning is often facilitated by the availability of open-access CNN models trained
on large datasets. Due to the advantages it offers, it is used in various DL application
domains including the ones that the following chapters of the thesis are concerned with,
namely, gesture recognition [167], autonomous driving [168], robotics [169], and medical
image classification [170].
1.4.3 Data augmentation
Another approach to dealing with the problem of data insufficiency is to use data augmenta-
tion techniques. The underlying assumption of data augmentation for image understanding
tasks is that it is possible to extract more information for training a model from the origi-
nal data by artificially inflating the training dataset by means of either warping images or
oversampling them [171]; importantly, the manipulations of the images should not change
the labelling, as that would make the augmented dataset unusable for training the model.
Some examples of image warping (see Figure 1.25) are geometric transformations such as
image flipping, random rotations, image cropping, image translation, random erasing, and
noise injection; furthermore, data warping may include colour transformations such as colour
space changes [172, 171]. The simplest approach to oversampling is image replication, that is,
duplicating images belonging to the underrepresented classes; some more advanced oversam-
pling methods are feature space augmentations and mixing images [171]. Furthermore, some
authors consider such approaches as GANs [12] to belong to the domain of data augmenta-
tion; however, it appears more reasonable to follow the distinction suggested by Nikolenko
[9] and reserve the use of the term ‘data augmentation’ for the methods involving recom-
bination and adaptation of real data while using the term ‘synthetic data’ for the methods
involving creating new data. In any case, the boundary between these two groups of methods
is sometimes quite blurry.
44
Figure 1.25: Examples of applying image augmentation techniques.
1.4.4 Synthetic data
The most promising approach to solving the data availability problem for training DNNs
is arguably the use of synthetic data, that is, artificially generated data that are similar
to the data from the target domain. Nowadays, synthetic data is used for a broad range
of deep learning applications for image understanding tasks, from the navigation systems
of autonomous vehicles and unmanned aerial vehicles to medical image analysis to crowd
counting (see Nikolenko [9] for a comprehensive overview). The means of generating syn-
thetic data vary widely as well, including GANs, 3D computer graphics software toolsets
such as Blender15, 3D gaming engines such as Unity16 and Unreal Engine17, and assets (e.g.,
screenshots) from videogames. Importantly, in many cases, it is not necessary to label syn-
thetic data, as labelling can be obtained automatically when generating a synthetic dataset.
That makes synthetic data much more convenient to use and often less expensive to acquire
than real-world data, which typically need labelling. Other potential advantages of using
synthetic data include reducing reliance on real-world data, addressing the underrepresenta-
tion of some classes in real-world datasets, and obtaining data that would be challenging to
acquire in the real world due to legal obstacles or privacy concerns.
In the context of deep learning, there are several main ways to use synthetic data [9],
namely:
• to train DNN models such as classifiers or object detectors solely on synthetic data and
then run inference on the target domain data, i.e., on the real-world data;
• to refine (e.g., by means of using generative models such as GANs) synthetic data and
then train DNN models, as mentioned in the previous point, on the refined data;
15https://www.blender.org/; accessed 10 September 2024.
16https://unity.com/; accessed 10 September 2024.
17https://www.unrealengine.com/en-US; accessed 10 September 2024.
45
• to augment real-world datasets with synthetic data in order to obtain a better (e.g.,
more diverse, or larger, or both) training dataset.
The main obstacle for the applications of synthetic data in deep learning is the domain gap
between the real-world and synthetic data, which may lead to a decrease in the performance
of the models trained solely on synthetic images. To overcome this issue, one can adhere
to the augmentation approach rather than to rely on synthetic data alone, or try to make
synthetic images as photorealistic as possible, or employ simulation-to-reality (Sim2Real)
domain transfer techniques.
1.5 Concluding remarks
In this chapter, I provided the background for the research on image understanding pre-
sented in the rest of this thesis. I began with a broad outline of the field of computer vision,
emphasising its connection with various disciplines and subsequently focusing the discussion
on the narrower (yet still vast) topic of image understanding. Since my experimental work
reported in this thesis was mainly concerned with the three main image understanding tasks,
image classification, object detection, and semantic segmentation, I outlined the main met-
rics for measuring the performance of the models on these tasks and then discussed methods
for solving them, in particular, deep learning-based methods, which nowadays have largely
succeeded their so-called classical predecessors. I offered an in-depth overview of the par-
ticular models I used for my experiments, namely, MobileNetV2, MobileNetV3Large, and
EfficientNet-B7 image classifiers, DeepLabv3 semantic segmentation models, and YOLOv5
object detector. My goal was to capture the essential features of these models such as (to
mention just a few examples) the use of depthwise separable convolutions in MobileNet mod-
els, the notion of scalability underlying EfficientNet models, the use of atrous convolutions in
DeepLab models, and predicting both class probabilities and bounding boxes with a single
network in YOLO models. To present these models adequately, it was necessary to place
them in their historical context, that is, to compare them either with their predecessors
from the same model family (e.g., MobileNetV3 with MobileNetV2, and MobileNetV2 with
MobileNetV1, or YOLOv5 with the earlier versions of YOLO), or with related architectures
(e.g. Xception – ‘Extreme Inception’ – with Inception). In doing all that, I had to tread a
fine line between presenting the information in a rather schematic manner on the one hand
and attempting at transforming the chapter into something akin to a deep learning primer
on the other hand. As a consequence, tradeoffs were necessary, and while I hope that I have
presented some of the concepts and operations underlying the models in sufficient detail as
to allow the reader to understand them without resorting to additional sources, some other
features of the models were just briefly mentioned, as elaborating on them would require
overly lengthy detours. The same considerations apply to the discussion of another essential
aspect of deep learning – data for training and validating models: while I highlighted the
crucial role of data in deep learning and the impact they have on shaping the practices and
goals of the AI community as well as outlined the main means of mitigating the problems of
insufficient data and dataset bias such as transfer learning, data augmentation, and the use
of synthetic data, the said discussion was unavoidably limited in scope and depth. On the
whole, while I would prefer the background chapter of this thesis to be as self-contained as
possible, it appears to me that an uneven level of detail was hardly avoidable, as while quite
a few deep learning concepts can be understood even without having background in that
field, many of the current state-of-the-art CNN models, e.g. the recent versions of YOLO
46
object detectors, and many advanced approaches to data augmentation and synthetic image
generation incorporate a lot of complex solutions and therefore do not lend themselves well
to brief yet exhaustive explanations ab initio.
47
Chapter 2
Hand-Washing Movement
Classification
In this chapter, I explore applications of CNNs to the classification of hand-washing move-
ments with the goal of designing an automated system of monitoring the quality of hand
washing to evaluate and improve compliance with hand hygiene guidelines in a hospital set-
ting. The chapter is based on the following scholarly articles and conference papers:
[173] M. Ivanovs, R. Kadikis, M. Lulla, A. Rutkovskis, and A. Elsts, “Automated quality
assessment of hand washing using deep learning,” arXiv:2011.11383, 2020;
[174] M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Gromova, M. Ivanovs, A.
Skadins, R. Kadikis, and A. Elsts, “Hand-washing video dataset annotated according
to the world health organization’s hand-washing guidelines,” Data, vol. 6, no. 4, p.
38, 2021;
[175] O. Zemlanuhina, M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Melbarde-
Kelmere, A. Elsts, M. Ivanov, and O. Sabelnikovs, “Influence of different types of
real-time feedback on hand washing quality assessed with neural networks/simulated
neural networks,” in SHS Web of Conferences, vol. 131, p. 02008, EDP Sciences,
2022;
[176] A. Elsts, M. Ivanovs, R. Kadikis, and O. Sabelnikovs, “CNN for hand washing
movement classification: What matters more – the approach or the dataset?,” in
2022 Eleventh International Conference on Image Processing Theory, Tools and Ap-
plications (IPTA), pp. 1–6, IEEE, 2022.
My contribution to these studies was as follows. Although I did not personally conduct
data collection or labelling, I was involved in planning and overseeing these stages of the
dataset design. Furthermore, I actively participated in processing the datasets and conducted
experiments with CNNs; however, it should be mentioned that in the study by Elsts et al.
[176], my senior colleague Atis Elsts lead the work. Additionally, I was actively involved
in writing both the drafts and the final versions of the manuscripts of all the above studies
except the study by Zemlanuhina et al. [175], where my involvement in writing was less
direct.
48
2.1 Introduction
To better understand the need for an automated system of monitoring the quality of washing
hands, it is essential to consider the role of hand hygiene in healthcare. Poor hand hygiene
is a primary cause of the two major epidemiological issues: the spread of multidrug-resistant
(MDR) bacteria, and the spread of infections in a clinical setting [177]. The literature
underscores that these issues are of a considerable scale: it has been estimated that infections
caused by MDR bacteria result in more than 35 000 deaths in the EU/EEA [178] and 1 270 000
deaths worldwide [179] annually. Given the severity of this problem, the European Centre for
Disease Prevention and Control (ECDC) and the World Health Organisation (WHO) have
designated limiting the spread of anti-microbial resistance as one of the global healthcare
priorities. Furthermore, hands are the main pathway of germ transmission in the healthcare
environment [180], which is a large-scale problem as well: it has been estimated that 8.9
million cases of healthcare-associated infections occur each year in hospitals and long-term
care facilities in the EU/EEA [181]. More than half of these cases are considered preventable
[181], with better hand hygiene being one of the primary means of prevention. To further
emphasise the scale of the problems caused by poor hand hygiene, I would like to note that the
above estimates predominantly date back to the period before the COVID-19 pandemic, and
although it has eventually been established that the SARS-CoV-2 virus is mainly transmitted
by air rather than via contaminated surfaces [182], the global health crisis caused by the
pandemic has made the issue of proper hand hygiene even more pressing.
As better compliance with hand hygiene recommendations reduces the prevalence of
healthcare-associated infections [183], it is crucially important to follow them. The most well-
known and widely adopted hand-washing technique is provided in the WHO guidelines [184];
it comprises the six key hand-washing movements (see Figure 2.1) that should be performed
for the overall duration from 40 to 60 seconds. Although this technique is often promoted
in educational campaigns [185] and is arguably easy to follow, epidemiological research indi-
cates that both general public and medical professionals tend to neglect it [186, 187, 188],
thus highly increasing the risk of spreading infections. While healthcare workers are usually
well aware that they should maintain high standards of hand hygiene, they may (actually,
as observational data indicate, typically tend to) wash hands for a period of time that is
too short, omit some of the movements mentioned in the guidelines, or even forget to wash
their hands altogether. That often happens due to the notoriously heavy workload and time
pressure in healthcare; some other likely reasons for the non-compliance with the guidelines
are lack of knowledge, hand skin irritation, and poor accessibility to disinfectants [184]. All
in all, the data on the compliance with the requirements for hand hygiene are quite alarming,
as research indicates that healthcare workers wash their hands properly in only about 40%
of cases [187, 188], which can hardly be considered a satisfactory performance. To give a
reference point, the goal of some educational campaigns was to attain compliance with the
WHO guidelines of at least 90% [189].
To improve hand hygiene, it is recommended to use a comprehensive approach [190,
186, 191], which includes organising awareness-raising campaigns, posting up reminders in
workplaces, ensuring that hand sanitisers are easily available, carrying out regular audits,
and assessing and providing feedback on how well hands are washed. However, it should be
noted that interventions reported in the literature have not always resulted in sustainable
results [192, 193], and improving hand hygiene compliance in healthcare settings still remains
a challenge.
One of the ways to address the issue of the quality of hand washing is to monitor it
with the purpose of both observing the level of compliance with the WHO protocol and
49
0Apply enough soap to cover 
all hand surfaces;
Wet hands with water;
3
Right palm over left dorsum with 
interlaced fingers and vice versa;
Palm to palm with fingers interlaced; Backs of fingers to opposing palms 
with fingers interlocked;
6
Rotational rubbing of left thumb 
clasped in right palm and vice versa;
Rotational rubbing, backwards and 
forwards with clasped fingers of right 
hand in left palm and vice versa;
Rinse hands with water;
9
Dry hands thoroughly
with a single use towel;
21
Rub hands palm to palm;
4 5
7 8
11
Your hands are now safe.
10
Use towel to turn off faucet;
Figure 2.1: Steps of washing hands with soap and water according to the WHO guidelines.
Reproduced from [184].
improving hand hygiene habits of medical personnel [194]. That can be done by such means
as tracking hand disinfectant consumption [195], direct real-time observation, or filming and
subsequently assessing the videos of how medical personnel wash their hands. Currently,
the established standard of monitoring is the direct observation approach [194, 196, 183];
although it offers multiple advantages, it also suffers from certain shortcomings, namely, it
is time- and resource-consuming, the observer may fail to notice important details, and the
whole direct observation paradigm is potentially subject to the Hawthorne Effect, which
implies that a person often changes their behaviour when they know that they are being
observed but may revert to their previous behaviour when they know that the observations
are over [194, 196].
To address the shortcomings of hand-washing monitoring, it can be automated, i.e., a
human observer can be replaced with a system functioning on its own. Such a system would
allow to collect a larger amount of observational data with better precision [197]; even more
importantly, it would be capable of functioning continuously and uninterruptedly around the
clock. While it is possible to use such data modalities as accelerometer data for monitoring
hand washing movements, a more prevalent approach (see Section 2.2) is to use video capture
from a camera installed above a sink. The main objectives of a monitoring system based
50
on the video stream processing are to capture visual information (real-time video) of hand-
washing episodes and recognise and classify hand-washing movements; furthermore, it would
be helpful to enable the system to provide feedback to the user during each hand-washing
episode as well as to store the data (either recordings or their summaries) of all hand washing
episodes for further analysis [175]. To function successfully, such a system should be capable
of:
• recognising and classifying hand washing episodes with sufficient accuracy;
• tackling the domain shift problem, that is, performing well in new locations and with
new users;
• preserving the privacy of the users: for instance, it should not transfer the data in such
a way that they could be intercepted, as video footage can contain potentially sensitive
information;
• running on devices with low power consumption such as edge or mobile devices, as
that would facilitate the installation and practical usage of the monitoring system in a
real-world scenario.
A promising approach to implementing a system with all the features listed above is to
use a smartphone or an edge device as its central unit. The cameras of modern smartphones
allow for capturing video with a rather high resolution and frame rates; furthermore, their
screens and speakers make it possible to provide different kinds of audiovisual feedback, while
their haptic actuators can be used to provide tactile feedback by means of vibration. As for
edge devices, they have the advantages of being comparatively inexpensive and more portable
and therefore easier to install than traditional computers.
Regarding the software implementation part of a video-based hand-washing monitoring
system, the most challenging part of it is an algorithm for recognising and classifying hand-
washing movements. As state-of-the-art results in gesture and movement classification have
been achieved with CNNs [198], it appears promising to use these algorithms for the purpose
of classifying hand-washing movements. Therefore, the goal of the research described in this
chapter was to develop a CNN-based hand-washing movement classifier for a hand-washing
monitoring system. The main hypothesis was that lightweight CNNs, i.e., CNNs capable
of running on mobile and edge devices, can successfully, i.e., with the accuracy above that of
a putative ‘naive’ classifier, classify hand-washing movements in a real-world hospital setting.
The rest of the present chapter is organised as follows. In Section 2.2, I give an overview
of related work, discussing methods for monitoring hand washing. In Section 2.3, I describe
datasets containing hand-washing recordings: first, I characterise some such datasets that
are publicly available, then I proceed with the descriptions of two datasets – PSKUS, and
METC – that were acquired and processed by the team that I was a part of. In Section 2.4,
I report the initial experiments on these two datasets, and in Section 2.5 – the experiments
in a cross-dataset study, which comprised not only these two datasets, but also the Kaggle
dataset [199]. Finally, in Section 2.6, I finish this chapter with some concluding remarks.
2.2 Related work: methods for monitoring
hand-washing
I begin a brief overview of related work by highlighting the key points in the surveys on the
digital methods for monitoring hand hygiene. Several such surveys that are worth noting are
51
works by Ward et al. [200], Srigley et al. [197], and a more recent publication by Wang et
al. [201].
Of the 42 scientific articles surveyed by Ward et al. [200], fewer than 20% included precise
estimates of the efficiency or accuracy of the presented systems, which indicates that this
crucial aspect of the automated hand-washing monitoring systems was insufficiently explored
at that time. In general, the surveyed systems typically evaluated the quality of hand washing
with a simple binary metric such as done/not done or used a single timer to detect its total
duration, yet neither approach is sufficient to ensure that all palmar surfaces have been
properly cleaned [200]. Furthermore, although some of the systems surveyed in that study
were classified as fully autonomous, they also included a wearable and mobile component,
which could impede their use in a real-world scenario. Ward et al. [200] also identified the
issues with cost, patient privacy, and lack of validation of the designed systems.
Srigley et al. [202] provided a systematic efficacy review of hand hygiene monitoring.
Most of the studies they surveyed only monitored general compliance with hand hygiene
guidelines; only one study also monitored the duration of hand-washing movements. In terms
of feedback, only two systems provided individualised feedback and real-time reminders. All
in all, Srigley et al. [202] considered the evidence for clinical adoption of electronic and
video-based hand hygiene monitoring systems insufficient. It is also worth mentioning that
for the future work, they suggested focusing on video-based monitoring approaches.
Wang et al. [201] conducted a comprehensive bibliographic search and identified 89 studies
of interest. In 73 of these studies, electronic systems of monitoring hand hygiene were used,
which had the following features:
• application-assisted direct observation: 5 out of 73 or 7%;
• camera-assisted observation: 10 out of 73 or 14%;
• sensor-assisted observation: 29 out of 73 or 40%;
• real-time locating system: 32 out of 73 or 44%.
Furthermore, in 21 of the surveyed studies, some type of evaluation of hand hygiene
quality was carried out, namely, the compliance with the WHO protocol was evaluated in
14 studies (67%), and the evaluation by means of applying fluorescent substances to the
hand surfaces and observing changes in their illumination, which reflects the quality of hand
washing, was done in the remaining 7 studies (33%). The authors identified such limitations of
electronic hand hygiene monitoring systems as insufficient accuracy, privacy, confidentiality,
and usability; in addition to that, they observed that there was a lack of standardised metrics
to evaluate performance across such systems.
Before discussing computer vision-based approaches, I would like to note that the surveys
mentioned above indicated that it is possible to monitor hand washing using data acquired
from non-visual sensors. Some examples of that are found in Wang et al. [203] and Galluzzi
et al. [204, 205]. Wang et al. [203] used armband sensor data and achieved high recognition
accuracy of the different movements according to the WHO guidelines, especially for the user-
dependent model, namely, 96% vs. 82% for the user-independent model. Galluzzi et al. [204,
205] reported the accuracy of 93% using consumer-grade wrist-worn accelerometers when
recognising the movements defined by the WHO. It is worth noting that in their experiments,
the devices were worn on both wrists, likely making the setup rather cumbersome.
All in all, approaches based on the wearable sensors have several drawbacks: first, they
require medical staff to put in extra effort (e.g., to wear a personal wristband device), which
can arguably reduce compliance; second, wearable devices can interfere with the hand washing
52
procedure itself. Therefore, it appears reasonable to concur with the already mentioned
viewpoint in Srigley et al. [202] and consider computer vision-based systems to hold a better
promise for the design of hand-washing monitoring systems.
One of the earliest approaches to monitoring hand washing based on computer vision
was described by Hoey et al. [206], who focused on developing an assistant to aid dementia
patients with hand washing. They implemented a particle filter-based classification approach
and provided real-time feedback to the user; however, they did not try to distinguish between
the different types of hand washing movements. Llorca et al. [207] presented another vision-
based system with the explicit goal to provide an automatic hand-washing quality assessment
and recognise six different washing ‘poses’. As that study predates the era of deep learning, it
employed a classical machine learning approach with a complex pipeline involving skin color
detection, hand segmentation, a particle filtering model for hand tracking, and an SVM-based
classifier for the movement recognition. Although the reported accuracy was high (70.1% to
97.8%, depending on the motion) on a dataset with 4 test subjects, the generalizability of
the approach raises some doubts if more test subject were included, especially ones with a
darker skin colour, or if the approach were applied to videos taken in real-world conditions.
Since DNNs have achieved good results on many perceptual data-based tasks, there have
been a number of studies using them for hand-washing movement classification according
to the six-step WHO guidelines. Thus, Yeung et al. [208] presented a computer vision and
deep learning-based system for hand hygiene monitoring. The system used depth sensor
modality instead of the full video data to preserve privacy, and the data were classified
using a CNN. However, the authors aimed to recognise adherence to hand hygiene in general
rather than evaluating whether the user performs all specific hand movements. The study
by Li et al. [209] investigated the applications of CNN for gesture recognition in general
and achieved high accuracy, showing that this approach is suitable for the task. Prakasa
and Sugiarto [210] extracted frames from videos, converted the resulting images from RGB
channels to hue, saturation, and value (HSV) channels to obtain image component with high
contrast on the skin human region (the hue channel), and classified them using a custom
CNN classifier. However, the dataset they used consisted of a single instructional video;
therefore, the capacity of the model to generalise was not adequately tested. Nagaraj et
al. [211] designed a three-stream network architecture, based on classical works on two-
stream CNN fusion [212, 213]. The three streams utilized RGB frames, optical flow frames,
and histogram of gradients as the inputs, thus incorporating spatial, temporal, and object-
level information from the videos. The authors used the full Kaggle dataset [199] to show
that their approach performed better than any of the three modalities used separately, and
achieved 86.6% accuracy. Remarkably, such accuracy was achieved for 12-class separation1
and is likely to be even higher if a lesser number of classes were considered. Furthermore, the
authors also provided a GitHub repository with an implementation of the fusion classifier.
Another recent study by Cikel et al. [214] used the publicly available subset of the Kaggle
dataset [199] with hand-washing movements to train and evaluate 3 models consisting of a
ResNet-152 [76] CNN encoder and a decoder based on a 3-layer long short-term memory
(LSTM; [215]), using as input the RGB frames of the videos for the first one, the optical flow
for the second one, and a two-stream input made up of both RGB frames and optical flow
for the third one. The RGB network achieved an accuracy of 97.33%. Yamamoto et al. [216]
used vision-based systems and a CNN for a different problem, namely, for estimating how
well hands were washed. They compared the quality score of the automated system with the
1The 12-class problem arises when left-hand and right-hand washing movements are treated as separate
classes.
53
ground-truth data obtained by applying on hands a substance that was fluorescent under
UV light. Their results demonstrated that the CNNs are able to classify the hand washing
quality with high accuracy.
All in all, as follows from the above brief survey of literature, there have been a number
of studies reporting high hand-washing movement recognition accuracy using CNN classi-
fiers based on pretrained models, which are often extended by implementing a multi-stream
network architecture, or by incorporating recurrent elements of DNN architecture such as
LSTM. At first glance, these results seem to suggest that the problem of automated monitor-
ing of hand-washing is on the verge of being solved, if not already resolved, yet they should
be taken with some caution, since the surveyed studies typically utilised datasets collected
in lab environments, such as the Kaggle challenge dataset [199], and therefore it is far from
certain whether their findings can translate well into applications in real-world scenarios.
2.3 Hand-washing recording datasets
Only a few datasets featuring recordings of hand washing were publicly available at the
time when the research reported in this chapter was conducted. Thus, the Kinetics Human
Action Video Dataset [217] by Google contains 916 videos of washing hands, whereas the
STAIR Actions dataset [218], which consists of more than 100 000 videos, contains around
1 000 videos that are related to washing hands. The Kaggle data science website2 hosts the
already mentioned Kaggle Hand Wash Dataset [199], which consists of 292 hand-washing
episodes labelled according to the WHO guidelines; however, only 25 of these episodes are
publicly available. The episodes were recorded at a resolution of 720×480 pixels at the frame
rate of 30 frames per second (FPS); the recordings were done in lab conditions with good
lighting and the movements being meticulously correct. Altogether, these three datasets have
several substantial limitations: none of them has more than 1000 hand-washing videos, they
do not feature medical professionals or a clinical setting, and only the Kaggle dataset, the
publicly available part of which is rather small, comes along with labelling according to the
WHO guidelines. Therefore, it appears that the public availability of the data for training
ML-driven classifiers of hand-washing movements is rather limited.
To address the lack of data for training and evaluating hand-washing movements classi-
fiers, a group of Latvian epidemiologists and their support team collected and annotated two
datasets of hand-washing videos in a clinical setting: the PSKUS dataset, and the METC
dataset. To make the datasets suitable for ML tasks, the data collection and annotation were
done in cooperation with the research team at EDI, of which I was a part. Afterwards, I con-
ducted classification experiments by training and evaluating CNN models on these datasets.
Therefore, I describe both these datasets in the following.
2.3.1 PSKUS dataset
Data acquisition
The PSKUS dataset was collected in the summer of 2020 at one of the largest hospitals
in Latvia, Pauls Stradins Clinical University Hospital (Paula Stradin¸a kl¯ıniska¯ universita¯tes
slimn¯ıca (PSKUS) in Latvian, hence the name of the dataset) using a custom Internet-of-
Things (IoT) system (Figure 2.2). Each instance of the system consisted of one or several
2https://www.kaggle.com/; accessed 10 June 2024.
54
(a) (b)
Figure 2.2: Data acquisition setup: (a) prior to deployment (b) deployed in hospital.
AirLive IP POE 100CAM cameras or Axis M3046V IP cameras installed above sinks for
washing hands and connected to Netgear 5-Port PoE Gigabit Ethernet switch GS305P, and
a Raspberry Pi 4 device with a microSD card that stored the Raspberry operating system, a
custom data acquisition program, and acquired video files. Such IoT systems were deployed
in nine different locations simultaneously, with one location corresponding to one sink. In
total, there were 12 cameras, as some of the Raspberry Pi devices had more than one camera
attached to them, which made it possible to record hand washing at a single sink simul-
taneously from different angles. The locations where the cameras were deployed included
a hospital neurology unit, surgery unit, an intensive care unit, and other hospital units in
Pauls Stradins Clinical University Hospital.
The cameras recorded all continuous movements within their field of view; the video
stream was captured at the frame rate of 30 FPS and the resolution of either 640 × 480
or 320 × 240 pixels. To filter out irrelevant movements of short duration, e.g., a person
passing by, a recording was only started in the case when motion was detected for 3 seconds
continuously; as a result, the videos in the dataset may miss up to the first 3 seconds of each
hand-washing episode. Furthermore, videos shorter than 20 seconds were not saved by the
recording system to reduce the number of false positives in motion detections.
The recorded data were manually transferred to a central server at Pauls Stradins Clinical
University Hospital by bringing microSD cards from the Raspberry Pi devices and uploading
the data from them onto the server.
Overview and structure
The PSKUS dataset consists of video files along with their annotations in CSV and JSON
formats. Table 2.1 presents an overview of the dataset.
The folders and files in the dataset are structured as follows:
DataSets
\- Dataset1
\- Videos
\- 2020-06-27_11-57-25_camera104.mp4
\- 2020-06-28_18-28-10_camera102.mp4
55
\- ...
\- Annotations
\- Annotator1
\- 2020-06-27_11-57-25_camera104.csv
\- 2020-06-27_11-57-25_camera104.json
\- ...
\- Annotator2
\- 2020-06-27_11-57-25_camera104.csv
\- 2020-06-27_11-57-25_camera104.json
\- ...
\- Dataset2
\- Videos
\- ...
\- Annotations
\- ...
...
summary.csv
statistics.csv
Each video file has one or more annotations created by one or several annotators. For the
sake of convenience, annotations are provided in two formats, although most of the informa-
tion in the CSV and JSON files overlaps. Video files and their annotations have matching
names: thus, the video file name.mp4 has annotations in both name.csv and name.json.
Additionally, there are several files providing the overview information: the file summary.csv
contains a summary of the dataset, and the file statistics.csv contains the key metrics
for each hand-washing episode in the dataset.
Table 2.1: Overview of the PSKUS dataset.
Property Value
Frame rate (FPS) 30
Resolution 320 × 240 or 640 × 480
Number of videos 3 185
Number of annotations 6 690
Total washing duration (seconds) 83 804
Movement 1–7 duration (seconds) 27 517
Episodes with ring present 440
Episodes with armband or watch present 127
Episodes with long nails present 58
Labelling
Annotators – infectious disease experts involved in the research project, other medical pro-
fessionals, and volunteers, including Riga Stradins University students – were given access to
56
the video files on the server to label the data. At the preliminary stage of analysis, the anno-
tators vetted the videos to remove the files that did not include actual hand-washing episodes.
Each annotator did that independently based on the guidelines provided to them; as a result,
some files have been vetted out by one annotator, but not by the other one(s). Therefore,
in the final dataset, some videos may have annotations in one folder (e.g., Annotator1) but
not in another (e.g., Annotator2). Furthermore, the annotators are anonymised in the final
version of the dataset, and the folder Annotator1 in one part of the dataset is not necessarily
annotated by the same person as the folder Annotator1 in a different part of the PSKUS
dataset.
Afterwards, annotators labelled hand-washing movements in the videos using a custom
annotation program developed in the Python programming language using OpenCV com-
puter vision library, which allowed to assign to each video frame the following information:
(1) whether hand washing was visible in the frame, and (2) which (if any) of the movements
defined in the WHO guidelines hand washing corresponded to. The annotation guidelines
were developed by epidemiologists involved in data collection and acquisition on the basis
of the WHO guidelines. In total, seven different hand washing movements were defined as
recommended by the WHO; with the addition of the other movement, the list of movement
labels is given in Table 2.2.
Table 2.2: Movement codes for hand-washing movements.
Movement code Movement description
1 Palm to palm
2 Palm over dorsum, fingers interlaced
3 Palm to palm, fingers interlaced
4 Backs of fingers to opposing palm, fingers
interlocked
5 Rotational rubbing of the thumb
6 Fingertips to palm
7 Turning off the tap with a paper towel
0 Other hand-washing movement
According to the annotation guidelines provided for the annotators:
• Code from 1 to 6 is used to denote a correctly performed hand-washing movement that
corresponds to one of the movements from the WHO guidelines.
• Code 0 is used to denote either a movement from the WHO guidelines that is not per-
formed correctly, or any washing movement that is not defined in the WHO guidelines.
• Code 7 is used to denote the process of correctly terminating the hand-washing episode.
Specifically, for movement 7 to be recorded, the tap has to be turned off in the following
way: the person washing their hands has to take the towel, dry their hands with it,
and then turn off the tap with the towel rather than with a bare hand, thus preventing
potential contamination from the tap handle.
• Frames that do not capture hand-washing are labelled accordingly, i.e., with is washing
set to zero.
Additional annotations applied to the whole video rather than to separate frames indicate
whether the person that is washing hands is wearing a ring, watch, armband, or has long
57
Figure 2.3: Number of annotators per video in the PSKUS dataset.
artificial nails, as these items may interfere with the quality of hand washing [219, 184, 220].
These annotations are as follows:
• A - armband or watch.
• R - ring.
• N - long nails.
To increase the reliability of the annotations, the majority of files in the dataset are
labelled by more than one annotator (Figure 2.3). In particular, out of 3 185 videos, 882
were labelled by one annotator, 1 691 by two annotators, 355 by three annotators, 141 by
four annotators, 4 by five annotators, 8 by six annotators, and 104 by seven annotators.
Some statistics on the inter-annotator agreement are as follows: frames that were annotated
by two annotators have 91.23% agreement on the is washing label, and for those frames
where both annotators have assigned a value of 1 to is washing, there is further agreement
of 90.06% on the movement code. As for cases of inter-annotator disagreement, the following
reasons were identified in the dataset:
• short-term disagreement between the labels typically exists at time points when washing
movements change;
• movement 1 and movement 3 look quite similar and can be hard to distinguish when
filmed at an angle;
• the interpretation of what constitutes movement 7 has been different between the dif-
ferent annotators;
• some of the videos have low light levels.
Data distribution
After splitting the videos into frames and keeping only those where two or more annotators
assigned the same label, the data in the PSKUS dataset has the following distribution by
classes: class 0 - 922 724 frames, class 1 - 61 844 frames, class 2 - 93 924 frames, class 3
- 41 489 frames, class 4 - 53 188 frames, class 5 - 50 591 frames, class 6 - 51 112 frames
(Figure 2.4). Since class 0 is heavily overrepresented, only 20% of its frames, i.e., 207 732
frames, were used for training the models (see Sections 2.4 and 2.5).
58
Figure 2.4: Distribution of the frames in the PSKUS dataset by movement classes. The
dotted line indicates that only 20% of the frames from class 0 were used in the experiments.
2.3.2 METC dataset
As a key part of the study by Zemlanuhina et al. [175], another dataset of hand-washing
videos, the METC dataset, was acquired and labelled. The primary objective of that study
was to determine which type of real-time feedback most efficiently motivates medical staff
to follow the WHO guidelines for washing hands; while this objective is essential for the
development of an efficient hand-washing monitoring system, it lies beyond the scope of this
thesis, and I therefore report only the relevant parts of the research in Zemlanuhina et al.
[175] in the present section.
Data acquisition
The METC dataset was collected during a user feedback evaluation study [175] in July and
August 2021. The study was conducted at the Medical Education Technology Centre (METC;
hence the name of the dataset) of Riga Stradins University and involved 72 participants. All
participants were healthcare specialists: employees of Riga Stradins University, physicians,
and medical students; therefore, they were familiar with the requirements for maintaining
proper hand hygiene in a clinical setting. Each participant took part in one hand-washing
session, in the course of which they performed 3 hand-washing trials, receiving different types
of feedback on how they washed hands each time, namely:
• In the first trial, there was no guidance for participants on how to wash their hands,
i.e., they washed their hands using any sequence of hand washing movements they
considered appropriate for as long period of time as they considered sufficient.
• In the second trial, participants washed hands in a semi-guided mode, that is, they
were assisted by a custom-made smartphone application that provided real-time feed-
back indicating which hand-washing movements were recognised and displayed visual
information when a specific movement was performed for more than 7 seconds, i.e., for
a sufficient length of time. However, the application did not guide the participants or
informed them that the procedure should be performed for a certain period of time.
• In the third trial, participants washed hands in a guided mode, that is, the application
explicitly instructed them which movements and for how long a period of time to
59
perform.
Before each hand-washing session, participants were familiarised with the functionality of
the application and the feedback (if any) it would provide in each of the trials. Afterwards,
participants treated hands with an UV-active gel covering the surface of the palm, the dorsum
of the hand, the regions between fingers, and fingertips; subsequently, images of the treated
hands placed under the UV lamp uncovering all contaminated regions of the hand were taken.
After each washing trial, new images with the hands after washing were taken under UV light
exposure. As a result, by comparing the images of hands before and after the washing trial,
it was possible to evaluate the quality of hand washing.
Videos of the hand-washing sessions were recorded for further use; notably, all experiments
took place in the same location, i.e., over the same sink. As the use of the CNN classifier
in the hand-washing monitoring system was still in the process of development at the time
when the study was conducted, hand-washing movement recognition was performed by a
human operator, an expert in hand hygiene, rather than the automated system. The human
operator monitored how participants washed their hands in real time; the monitoring was
done on a PC located in a nearby room and relied on a live video stream from the camera
of the monitoring setup. The video streaming was facilitated by a PC application based on
a Flask web server, which was also used to annotate the videos.
Overview and structure
The METC dataset consists of video files along with their annotations in JSON format.
Table 2.3 presents an overview of the dataset.
The folders and files in the dataset are structured as follows:
DataSets
\- Interface_number_1
\- Annotations
\- 2021-06-30_09-56-19-11labots.json
\- 2021-06-30_10-28-33-2-1+labots.json
\- ...
\- Videos
\- 2021-06-30_09-56-19-11labots.mp4
\- 2021-06-30_10-28-33-2-1+labots.mp4
\- ...
\- Interface_number_2
\- Annotations
\- ...
\- Videos
\- ...
summary.csv
statistics.csv
Each video file has an annotation, provided in a JSON file. Video files and their anno-
tations have matching names: thus, the video file name.mp4 has annotations in name.json.
60
Additionally, there are several files providing overview information: the file summary.csv
contains a summary of the dataset, and the file statistics.csv contains the key metrics
for each hand-washing episode in the dataset.
Table 2.3: Overview of the METC dataset.
Property Value
Frame rate (FPS) ≈ 16
Resolution 640 × 480
Number of videos 212
Number of annotations 212
Total washing duration (seconds) 13 870
Movement 1–7 duration (seconds) 9 144
Episodes with ring present 0
Episodes with armband or watch present 0
Episodes with long nails present 0
Note that the frame rate of the videos was slightly variable, as they were reconstructed from sequences
of JPG images taken with the maximum frame rate supported by the capturing devices.
Labelling
The ground truth annotation of the videos was done in real time by a human operator,
an expert in hand hygiene. The annotation was carried out as follows: when the operator
recognised a hand-washing movement defined in the WHO guidelines, he would mark it
accordingly. In general, labelling was done in the same manner as for the PSKUS dataset. In
addition to that, the operator also simultaneously evaluated the quality of each recognised
movement: the operator marked the movement with OK if the movement was performed
correctly or with ! if there were any inaccuracies in performing the movement3.
It should be noted with respect to the quality of hand washing in the dataset that while the
participants in the study were knowledgeable medical staff and were instructed to complete
the task of washing hands to the best of their ability, imperfect execution of the protocol
was still present: some of the participants did not even complete all six basic hand-washing
movements, and none of the 72 participants performed movement 7 properly.
Data distribution
The data in the METC dataset have the following distribution by classes: class 0 - 64 980
frames, class 1 - 19 967 frames, class 2 - 22 651 frames, class 3 - 20 221 frames, class 4 - 17
576 frames, class 5 - 19 915 frames, class 6 - 24 682 frames (see Figure 2.5). Since class 0 is
overrepresented, only 50% of class 0 frames, i.e., 32 490 frames, were used for training the
models (see Sections 2.4 and 2.5).
3This type of labelling, i.e., OK vs. !, is not provided in the publicly available version of the METC
dataset, but its summary and analysis are provided in [175].
61
Figure 2.5: Distribution of the frames in the METC dataset by movement classes. The dotted
line indicates that onlly 50% of the frames from class 0 were used in the experiments.
2.4 Initial experiments on PSKUS and METC datasets
To obtain initial classification results on the PSKUS and METC datasets, I conducted ex-
periments by training and evaluating MobileNetV2 [105] CNN models on them. I describe
the methodology and results of these experiments in the following.
To preprocess the PSKUS dataset for experiments, I first split the data into trainval and
test subsets. The data for the test subset were images from one particular location, the
emergency ward, and the rest of the data was used as the trainval subset. Such a split of
the data was done to make sure that the data for training and evaluating the model do not
come from the same location. Afterwards, during training the model, the trainval subset was
randomly split for training and validation with the ratio of 80/20. As the METC dataset
was acquired in a single location, preprocessing it for experiments was more straightforward:
first, I divided the dataset into trainval and test subsets with the ratio of 80/20 and then,
during the training of the models, the trainval subset was further randomly split for training
and validation with the ratio of 80/20.
The training was done on a PC with Intel Core i7-12700K CPU, NVIDIA RTX 3090
GPU, and Windows 11 OS using TensorFlow v.2.16.1 and Keras v3.0.5. I conducted two
experiments per each dataset. The architecture of the model was identical across all exper-
iments: the base MobileNetV2 model with the weights pretrained on the ImageNet dataset
was extended with a preceding data augmentation layer featuring random rotations by up to
20 degrees and horizontal flips, and three follow-up layers:
• GlobalAveragePooling2D layer;
• Dense (i.e., fully connected) layer with 128 neurons, ReLU activation, and dropout rate
of 0.5 to avoid overfitting;
• Dense layer with 7 neurons and softmax activation function.
In the first experiment, I froze the base MobileNetV2 model and only trained the layers added
on top of it for 30 epochs with the Adam optimiser (learning rate = 0.001), the categorical
crossentropy loss function, and early stopping enabled so that the model would stop training
if there was no validation loss improvement for 10 consecutive epochs. When the training was
finished, the best (in terms of accuracy on the validation subset) checkpoint was retrieved
and evaluated on the test data. The results of evaluating the model trained on the PSKUS
62
dataset on the test subset of the same dataset were as follows: test accuracy = 55.65%,
precision = 24.06%, recall = 17.36%, and F1 score = 16.72%. See also the confusion matrix
in Figure 2.6.
Figure 2.6: Results of the initial experiment training MobileNetV2 (top layers only) on the
PSKUS dataset: confusion matrix showing the evaluation on the test subset.
Evaluation of the model trained on the METC dataset on the test subset yielded test
accuracy of 50.96%, precision of 53.83%, recall of 49.12%, and an F1 score of 49.74%. The
confusion matrix from this experiment is shown in Figure 2.7.
63
Figure 2.7: Results of the initial experiment training MobileNetV2 (top layers only) on the
METC dataset: confusion matrix showing the evaluation on the test subset.
Some observations that follow from these results are as follows. First, it is obvious that
F1 score is a more suitable metric than accuracy for such an unbalanced (due to the pre-
dominance of class 0) dataset as PSKUS. Second, it appears that the capacity of the models
for generalisation on the unseen data is starkly different for the PSKUS vs METC dataset:
while the F1 score of the model trained on the former dataset was just slightly above random
guessing, the performance of the model trained on the latter dataset was substantially better.
In the second experiment, I trained MobileNetV2 models for 10 epochs with the same
hyperparameters as in the first experiment and then continued training for additional 30
epochs with all model layers unfrozen, i.e., being trainable. To avoid the instability of the
model after unfreezing, the learning rate was decreased from 0.001 to 0.0001. The results of
the experiment with such parameters on the PSKUS dataset were as follows: test accuracy =
56.71%, precision = 35.27%, recall = 16.73%, and F1 score = 15.52%. See also the confusion
matrix in Figure 2.8.
64
Figure 2.8: Results of the initial experiment training MobileNetV2 (the model fully retrained)
on the PSKUS dataset: confusion matrix showing the evaluation on the test subset.
Similar experiment on the METC dataset yielded the following results: test accuracy =
63.75%, precision = 68.37%, recall = 62.56%, and F1 score = 63.89%. The confusion matrix
from this experiment is shown in Figure 2.9.
As it follows from the results of the second experiment, the MobileNetV2 model trained
on the METC dataset benefited from additional training with all the layers being unfrozen,
as its F1 score improved by more than 14% in comparison with the model trained with just
added layers being unfrozen. However, for the MobileNetV2 model trained on the PSKUS
dataset, the outcome of the same additional training was entirely different, as it led to a
decrease in the F1 score by ≈ 1%. Remarkably, it is not the case that the model in question
did not learn anything during the training, as its loss and accuracy improved consistently
during the training phase. However, the model seemed to be incapable of generalising the
learned knowledge when it came to classifying images coming from the new, previously un-
seen locations, i.e., the data in the test dataset. Another likely explanation for the worse
performance of the model trained on the PSKUS dataset is that the consistency of the la-
belling by multiple annotators was lower than that of a single annotator, as it was in the
case of the METC dataset. As a consequence, less consistent labelling made it more difficult
for the model to learn how to classify the images.
65
Figure 2.9: Results of the initial experiment training MobileNetV2 (the model fully retrained)
on the METC dataset: confusion matrix showing the evaluation on the test subset.
2.5 Cross-dataset study of CNN performance
As reported in the previous section, I conducted a number of experiments by training CNNs
on the two real-world datasets, PSKUS and METC, which were of different sizes and levels of
complexity. The MobileNetV2 model trained on the PSKUS dataset demonstrated a perfor-
mance similar to that of a putative ‘naive’ classifier on the test data, whereas its counterpart
trained on the METC dataset demonstrated substantially better results. However, even the
performance of the latter model was still not as impressive as the results reported in the
literature (see Section 2.2). One possible explanation for such a discrepancy is that there
is a substantial gap between what a CNN classifier can achieve on datasets collected in a
lab conditions with ideal lightning, the same position of camera and the same sink being
used across all trials, and emphatically correct and uniform performance of hand-washing
movements on the one hand, and datasets collected in far-from-ideal real-world conditions
(PSKUS) or nearly real-world conditions (METC) on the other hand. To investigate whether
that was the case, we4 conducted a study [176], which I report in the present section; the goal
of the study was to investigate whether the same model architectures that, as reported in the
literature, perform well on smaller and simpler datasets would also perform well on larger and
more complex datasets. Since the CNN models reported in the literature on the classification
4In this section of the chapter, I mainly use ‘we’ rather than ‘I’, as the experiments reported here were
done collaboratively by my colleague Atis Elsts and me.
66
of hand-washing movements are often extended with a multi-stream network architecture or
recurrent elements such as LSTM, we also utilised such extended architectures in our exper-
iments in addition to the single-frame classifiers that were used in the experiments reported
in the previous section. Furthermore, in this study, we continued using lightweight classifiers
such as MobileNets that can run on mid-range smartphones in inference mode without re-
quiring powerful hardware such as dedicated graphics accelerators, as we considered such an
approach to be more suitable for designing the real-world hand-washing monitoring systems
in the future.
2.5.1 Datasets and data preprocessing
(a) (b) (c)
Figure 2.10: Sample images featuring movement class 1 (rubbing palm to palm) from: (a)
the Kaggle hand-washing dataset; (b) the METC dataset; (c) the PSKUS dataset.
In our experiments, we used three datasets: the PSKUS dataset, the METC dataset, and
the publicly available part of the Kaggle hand-washing dataset. Sample images from each
datasets are shown in Figure 2.10; furthermore, the main statistics of each dataset are sum-
marised in Table 2.4.
Table 2.4: Main characteristics of the datasets used in the cross-dataset study.
Parameter Kaggle METC PSKUS
Washing episodes 25 213 3 185
Users ≤25 72 many
Locations ≤25 1 9
Environment Lab Lab Real-life
Resolution 720× 480 640× 480 640× 480,
320× 240
Frame rate (FPS) 30 ≈ 16 30
To further expand on the differences between the three datasets, the Kaggle dataset fea-
tures high-quality scripted hand-washing videos corresponding to each of the hand-washing
steps defined by the WHO [221]. That makes it markedly different from the METC dataset,
where some mistakes when executing hand-washing movement are still present despite pre-
liminary instructions given to the participant. The difference is even more conspicuous in
67
the case of the PSKUS dataset featuring medical staff washing their hands as part of their
normal job duties, as the videos in it were filmed in the real-life conditions and therefore
include hand washing positions that are partially out of the frame or partially occluded as
well as low and variable lighting conditions. All in all, in the PSKUS dataset, imperfect and
incomplete execution of the washing steps is a rule rather than an exception.
To make it possible to compare the results across the datasets, the latter were preprocessed
accordingly. The original labelling in the Kaggle dataset distinguishes between washing
left hand and right hand; since there is no such distinction in the two other datasets, the
respective classes were merged so that left-hand and right-hand movements would belong to
the same class. Furthermore, as the wrist washing movement (step 7, according to the WHO
guidelines) is not labelled in the other two datasets, the respective class was merged with
class 0, which corresponds to the other movement. As the videos in the METC dataset were
annotated by a human operator in real time, there was some delay between the actions in
the videos and the operator’s response. To eliminate it, a 1-second long video segment was
removed every time the class label would change in the data stream. Finally, as for most
videos in the PSKUS dataset there were annotations available by multiple annotators, the
same approach as in the experiments reported in Section 2.4 was used: only those parts of
that dataset for which two or more annotators had assigned matching class labels were used
for training and evaluating CNN models.
2.5.2 Experiments: methodology and results
Model architectures
The baseline model (see Figure 2.11 (a)) used in the experiments was MobileNetV2 [105]
with the weights pretrained on ImageNet [19] and the following architecture:
• input layer;
• data augmentation layer featuring random rotations by up to 20 degrees and horizontal
flips;
• preprocessing layer;
• baseline MobileNetV2 model;
• Flatten layer;
• included only in the model with extra layers: GlobalAveragePooling2D layer and two
Dense layers, each with 128 neurons, ReLU activation function, and a dropout rate of
0.2 to avoid overfitting;
• Dense layer with 7 neurons and softmax activation function.
In addition to the baseline model, two more complex types of architectures were used
in the experiments. The first one was a two-stream network (Figure 2.11 (b)) consisting
of two MobileNetV2 models, one of which processed the RGB input, while the other one
processed the optical flow input. The goal of adding the latter type of input to the model
was to represent motion between the consecutive frames and thus improve the capacity of the
model to capture the temporal aspect of the input. Relevant studies (see e.g. [211]) suggest
that temporal information is required for accurate movement recognition, as it is arguably
68
(a) (b) (c)
Figure 2.11: Architectures of the CNN models used in the cross-dataset study: (a) baseline
CNN; (b) two-stream CNN; (c) recurrent CNN.
not possible to differentiate between movement 1 and movement 3 using just a single image.
Overall, the architecture of the model was as follows:
• two parallel input layers;
• two parallel data augmentation layers;
• two parallel baseline MobileNetV2 models;
• Concatenate fusion layer;
• Flatten layer;
• included only in a model with extra layers: GlobalAveragePooling2D layer and two
Dense layers, each with 128 neurons, ReLU activation function, and a dropout rate of
0.2 to avoid overfitting;
• Dense layer with 7 neurons and softmax activation function.
The second complex type of architecture used in experiments was a recurrent CNN (Fig-
ure 2.11 (c)), with a time-distributed layer joining together five base MobileNetV2 models
and the Gated Recurrent Unit (GRU; [222]) used as the memory unit. Similarly to the two-
stream network described above, the goal of incorporating the GRU in the architecture was
to improve the capacity of the model to classify hand-washing movements by capturing the
temporal aspect of hand washing. The overall architecture of the model was as follows:
• five parallel input layers;
69
• five parallel data augmentation layers;
• five parallel baseline MobileNetV2 models;
• TimeDistributed fusion layer;
• GRU layer with 256 neurons;
• included only in a model with extra layers: two Dense layers, each with 128 neurons,
ReLU activation function, and a dropout rate of 0.2 to avoid overfitting;
• Dense layer with 7 neurons and softmax activation function.
The summary of the configuration, hyperparameters, and training procedure (details
follow) of the baseline, the two-stream, and the recurrent CNN models is given in Table 2.5.
Table 2.5: Configuration and hyperparameters of CNNs for the cross-dataset study.
Parameter Value
All networks / default values
Base model MobileNetV2
Initial weights ImageNet, 224× 224
Input image resolution 320× 240
Data augmentations Rotations, flips
N of fully connected layers 1
Layers retrained
1 (‘top’) or
all (‘full’)
Epochs 20
Batch size 32
Optimiser Adam
Learning rate default (0.001)
Loss function Cross-entropy
Classes 7
Two-stream networks
Streams RGB & optical flow
Fusion Before dense layers
Optical flow type Farneback
Optical flow step (seconds) 0.33
Recurrent networks
Recurrent element GRU
Frame step (seconds) 0.2
Frames 5
Baseline and Recurrent CNN with extra layers
Additional fully connected layers 2
Training procedure
Prior to training the models, the datasets were split into trainval and test subsets as follows:
70
• the Kaggle dataset was split in a 70/30 ratio, making sure that frames from a particular
video appear in either trainval or test split, but not both;
• similarly, the METC dataset was split to maintain the same condition, but in a 75/25
ratio;
• the PSKUS dataset was split in the same way as in the study reported in Section
2.4, that is, the data for the test subset were images from one particular location, the
emergency ward, and the rest of the data were used as the trainval subset.
Furthermore, since the data from the three different datasets had different original reso-
lution, the images were rescaled to the same resolution of 320× 240 pixels using the built-in
TensorFlow [77] functions with bilinear interpolation as the resize method. Afterwards, the
models were trained for 20 epochs each with enabled early stopping if there was no decrease
in the validation loss for 10 epochs; class weighting was used to deal with the class imbalance
in the datasets. In addition to experiments with training and evaluating models on the same
dataset, we conducted additional experiments employing transfer learning to measure the
generalization performance across datasets both before and after 10 epochs of fine-tuning. In
these experiments, we investigated only knowledge transfer from less complex to more com-
plex datasets, i.e., from the Kaggle dataset to the METC dataset, from the Kaggle dataset
to the PSKUS dataset, and from the METC dataset to the PSKUS dataset.
Results
As two of the datasets used in this study, the METC dataset, and particularly the PSKUS
dataset, are imbalanced in class representation, I report the results as F1 scores rather than
accuracy. While I report the results of the evaluation of the models both on the validation
and test data, note that these findings are not equally important: the most important results
are those achieved on the latter dataset split, as they demonstrate the performance of the
model on unseen test data, whereas the results on the former dataset split are important
insofar as they allow us to understand whether the model learns anything at all, which is a
topical question in cases when the performance of the model on the test data is unsatisfactory.
The results of the main experiments are shown in Figure 2.12. While it was expected
that the best results on the test subsets of respective datasets would be achieved by the
more complex models, i.e., either the two-stream network, or the recurrent CNN, the best
F1 score on the Kaggle and METC datasets was achieved by the baseline model, which only
used a single frame as the input: it achieved 96% F1 score on the Kaggle dataset and 64% F1
score on the METC dataset. Another notable result is that full retraining usually improved
the accuracy of the single-frame models while decreasing it in the case of the more complex
classifiers: thus, it improved the performance of the baseline classifier from 93% to 96% on
the Kaggle dataset and from 45% to 64% on the METC dataset, whereas it resulted in a
deterioration in performance of the two-stream model and the recurrent model from 94% to
61% and from 90% to 36% respectively on the Kaggle dataset, and from 46% to 38% and
from 55% to 33% respectively on the METC dataset. A possible explanation for that is that
full retraining caused overfitting in more complex models.
Most concerningly, none of the approaches showed even average performance on the test
data on the most complex among the datasets, the PSKUS dataset. Namely, the best clas-
sifier, which was the two-stream model with only the top layers retrained, achieved the F1
score of only 21%; remarkably, the poor performance was not caused by a failure to learn, as
the F1 score on the validation data was above 95% in some of the experiments (fully retrained
71
(a) (b)
(c)
Figure 2.12: F1 scores of different CNN architectures evaluated in the cross-dataset study:
(a) on the Kaggle dataset; (b) on the METC dataset; (c) on the PSKUS dataset.
single-frame model and RNN); arguably, the problem lies in the poor generalisation ability
of the models to the more complex data.
The results of the experiments with the baseline and recurrent CNN classifiers5 with two
extra Dense layers (see Figure 2.13) showed the same pattern as the results in the main
experiments: the best-performing type of architecture was still the fully retrained single-frame
model, which performed very well (96% F1 score) on the Kaggle dataset and satisfactorily
(64% F1 score) on the METC dataset, but showed no improvement in comparison with the
same model without extra layers. While adding extra layers improved the performance of the
single-frame model on the PSKUS dataset both for the model with top only trained (+3%)
and for the fully retrained model (+5%), the F1 score of 25% achieved by the best among
them, the single-frame fully retrained model, was far from satisfactory.
Finally, with respect to the results of transfer learning experiments (Figure 2.14), it is
worth noting that one of the retrained models, the Kaggle–to–METC model, achieved the
best performance among all experimental groups on the METC dataset (65% F1 score), while
another such model, the Kaggle–to–PSKUS model, achieved the best performance among all
experiment groups on the PSKUS dataset (27% F1 score). However, aside from this, the
lessons from knowledge transfer attempts are not encouraging, as none of the classifiers
showed acceptable performance before being retrained on the new dataset, and the recurrent
5As the two-stream network did not show a major improvement on the two other approaches in any of
the experiments, we excluded this type of architecture from further experiments.
72
(a) (b)
(c)
Figure 2.13: F1 scores of the different CNN architectures with two extra Dense layers in the
cross-dataset study: (a) on the Kaggle dataset (b) on the METC dataset (c) on the PSKUS
dataset.
CNN initially trained on the Kaggle data completely failed to learn on the PSKUS dataset.
Furthermore, after retraining, the performance on the new dataset was all in all quite similar
to the performance when the classifier was trained straight from the MobileNetV2 base. This
suggests the classification knowledge learned by the models in the experiments reported in
this section was not transferable to new data contexts.
To summarise the results, lightweight CNN-based classifiers that performed well on the
Kaggle dataset (>95% F1 scores) demonstrated mediocre performance on the more complex
lab-based METC dataset (50–60% F1 scores), and failed to generalise on the real-life PSKUS
dataset. Adding temporal information typically reduced generalisation performance, likely
because the more complex models were more susceptible to overfitting, adding extra Dense
layers was helpful only in certain cases such as training the single-frame model on the PSKUS
dataset, and the effect of full retraining depended on the architecture of the model, gener-
ally improving performance of single-frame models and deteriorating performance of more
complex models. In summary, these results demonstrate that the dataset is in fact more im-
portant than the approach when evaluating hand washing movement classification accuracy,
and that translating the existing work on hand washing movement classification from the lab
to the real-life conditions is not straightforward at all.
73
(a) (b)
(c)
Figure 2.14: F1 scores of the different CNN architectures with transfer learning in the cross-
dataset study: (a) from the Kaggle dataset to the METC dataset (b) from the Kaggle dataset
to the PSKUS dataset (c) from the METC dataset to the PSKUS dataset.
2.6 Concluding remarks
In this chapter, I described the research on the use of CNNs for the classification of hand-
washing movements with the primary goal of creating an automated system for monitoring
hand hygiene in a clinical setting. This goal is particularly relevant for healthcare: as poor
hand hygiene in hospitals causes numerous infection cases and even deaths, an automated
monitoring system would be of substantial help for improving the hand hygiene habits of
the medical staff. While there have been several studies aiming at recognising hand-washing
movements according to the WHO guidelines, ML models in these studies were trained on
datasets collected in lab conditions. Since there had not been any publicly available large real-
world datasets of hand-washing videos with labelling corresponding to the WHO guidelines,
the research reported in this chapter started with collecting the data: first, the large real-
world PSKUS dataset, and then the METC dataset, which can be positioned between simpler
lab-collected datasets such as the Kaggle dataset and more complex real-world datasets.
Preliminary experiments on the PSKUS and METC datasets demonstrated that the
lightweight CNN classifier MobileNetV2 did not outperform a putative ‘naive’ classifier on
the former dataset; as to its performance on the latter dataset, it was substantially better,
but still much worse than that typically reported in the literature on the use of CNN-based
classifiers for hand-washing movement classification. The first conclusion from this is that the
impressive classification results that have recently been reported in the literature on hand-
74
washing movement classification should be taken with a grain of salt, as the methods they
have been achieved with may not translate well to complex real-world applications. The sec-
ond conclusion is that the problem of the domain shift regarding the datasets for CNN should
be taken seriously, as the research reported in this chapter demonstrated that changing the
task from classification of the video stream from over one sink (the METC dataset case) to
the classification of the stream from over several sinks (the PSKUS dataset case) dramati-
cally degrades the performance of classifiers with the same architecture. Taking into account
that the best performance of the CNN-based classifiers on the target dataset, the PSKUS
dataset, is far from satisfactory, the third conclusion is that the problem of classification on
this real-world dataset has not been solved yet, and further work is needed.
Furthermore, it is worth noting that the PSKUS dataset has become publicly available
three years ago, in February 2021, and since then it has attracted some attention from
the research community working on the problem of hand-washing movement classification:
according to the statistics provided by the host website6, it has been viewed 2422 times and
downloaded 2410 times. In addition, according to Google Scholar metrics, the article in
MDPI Data journal describing the dataset was cited 12 times7, and the publication [176] on
the cross-dataset study was cited 6 times8. However, somewhat surprisingly, the use of the
PSKUS dataset in the publications citing it appears to be limited only to the Related Work
(or equivalent) section, that is to say, the authors of the publications merely acknowledge the
existence of the PSKUS dataset but never actually try to solve the classification problem on
it. Therefore, designing a classifier with a sufficiently high accuracy on the PSKUS dataset
remains a task for the future.
6https://zenodo.org/record/4537209; accessed 16 August 2024
7https://scholar.google.lv/scholar?oi=bibs&hl=en&cites=4959999994190408658; accessed 16
August 2024
8https://scholar.google.lv/scholar?oi=bibs&hl=en&cites=13054653446600211893; accessed 16
August 2024
75
Chapter 3
Semantic Segmentation of Street
Views
In this chapter, I present research on the use of CNNs for semantic segmentation of street
views, which is a pivotal task for the design of self-driving cars. The primary focus of the
research I report here is on the promising approach of augmenting the dataset of real-world
images with synthetic data, which helps tackle the problem of the limited availability of
labelled data for training semantic segmentation models.
A large part of the insights and material in this chapter comes from the following publi-
cation in a scientific journal:
[223] M. Ivanovs, K. Ozols, A. Dobrajs, and R. Kadikis, “Improving semantic segmentation
of urban scenes for self-driving cars with synthetic images,” Sensors, vol. 22, no. 6:
2252, 2022.
As the first author of the above publication, I was responsible for planning and conducting
experiments for synthetic data generation as well as for training and validating CNN-based
semantic segmentation models. I also took the lead in analysing the experimental results,
documenting the findings, and drafting the manuscript. Finally, I played a principal role in
revising and finalising the manuscript in collaboration with my co-authors.
3.1 Introduction
There are numerous applications for methods used in the semantic segmentation of images
of urban areas, such as estimating building footprints [224], mapping urban green spaces
[225], and detecting water bodies [226] and slums [227]. For these purposes, aerial or satellite
images are typically used, as they provide a bird’s-eye view of cityscapes. However, when
it comes to the images of street views – by which I mean panoramic images taken from a
position close to the ground, i.e., from the vantage point of a driver or a pedestrian – there
appears to be a single major application for semantic segmentation methods, namely, for use
in the navigation system of self-driving cars. Therefore, to underscore the relevance of the
work reported in this chapter, I begin with a brief exploration of the topic of self-driving
cars.
Self-driving cars, which are also referred to as robotic cars [228], autonomous vehicles
[229, 230], and driverless vehicles [231], are currently one of the most promising emerging
technologies and a lively area of academic and industrial research. Research laboratories,
universities, and companies have been actively working on designing self-driving cars since
76
the mid-1980s [232]; in the last decade, research on self-driving cars and development of their
prototypes with varied degrees of autonomy have gained momentum with an increasing focus
on technologies for data acquisition and processing [233]. However, the task of developing
self-driving cars with the highest level of autonomy – i.e., autonomous to such an extent that
no human intervention during driving is required in any circumstances [234] – still remains
an unsolved challenge.
Figure 3.1: Schematic overview of the autonomy system of a self-driving car, including Traffic
Signalization Detection (TSD) and Moving Objects Tracking (MOT). Reproduced from [232].
The architecture of the autonomy system in self-driving cars is typically divided into
two main modules (Figure 3.1): the perception system, and the decision-making system
[235, 232]. The perception system deals with image understanding tasks such as object
recognition, object localisation, and, which is particularly relevant to the topic of this chapter,
semantic segmentation [236]. Similar to other visual domains, the best results in semantic
segmentation of street views have been achieved using CNNs [237]. However, the issue of data
availability limits the feasibility of using CNNs for this purpose, as acquiring images in an
urban setting is both expensive and time-consuming; in addition to that, sharing the acquired
data publicly may encounter legal obstacles due to privacy concerns. Furthermore, pixel-wise
labelling of complex urban scenes takes a lot of time and effort: thus, it was reported that for
the creation of Cambridge-driving Labeled Video Database (CamVid; [238]), labelling took
around 1 hour per image, whereas in the case of the Cityscapes dataset [130], fine pixel-level
annotation and quality control of a single image required on average more than 1.5 hours.
As outlined in Section 1.4.4, one of the approaches to tackling the problem of data avail-
ability for training DNNs is to resort to the use of synthetic data, i.e., artificially gener-
77
ated datasets that are to some extent similar to the real-world data. Using synthetic data
eliminates or at least decreases the need for real-world data acquisition as well as provides
additional advantages. A synthetic image generation pipeline can be modified to produce
more diverse data – e.g., by changing the weather conditions, or increasing the number of
particularly salient objects such as cars, pedestrians, traffic lights, and traffic signs. This
allows the production of diverse synthetic data on a large scale. Furthermore, such pipelines
usually eliminate the need for manual image annotation, as the contours of both the objects
and the background can be obtained automatically and with a high degree of precision.
The goal of the work reported in the present chapter was to improve the accuracy of
semantic segmentation of street views by augmenting a dataset of real-world images with
synthetic data. In particular, I investigated whether it is possible to improve the accuracy
of semantic segmentation by using synthetic data generated with an open-source driving
simulator CARLA [11], which can be done in a relatively simple, fast, and largely automated
manner. The main hypothesis of the study was that augmenting real-world data with
synthetic data would result in the improved accuracy of semantic segmentation of street
views.
The rest of the chapter is structured as follows. In Section 3.2, I give an overview of
related work, surveying datasets and methods for semantic segmentation of street views. In
Section 3.3, I describe three datasets of street views that I used for training CNN models in
the study. Section 3.5.2 presents the results of the experimental study and their discussion;
it is followed by Section 3.6, which offers concluding remarks.
3.2 Related work: datasets and methods for semantic
segmentation of street views
Since labelling semantic segmentation datasets is labour-intensive, only a limited number
of publicly available semantic segmentation datasets exist, and they tend to be compara-
tively small in size. The CamVid dataset [238] consists of 700 annotated images obtained
from a video sequence of 10 minutes; the pixel-wise labelled subset of the Daimler Urban
Segmentation Dataset [239] contains 500 images; the KITTI semantic instance segmentation
benchmark dataset [240] consists of 200 semantically annotated training and 200 test images;
finally, in the Cityscapes dataset [130], there are 5 000 fine-labelled and 20 000 coarse-labelled
images. Due to its comparatively large size and the high quality of the annotations and doc-
umentation, Cityscapes is the best-known and widely used dataset of semantically annotated
street views. In particular, according to the Cityscapes benchmark suite for pixel-level se-
mantic segmentation1, at the time of writing, there were 294 models listed in the benchmark
table, each submitted to the automated benchmarking server. However, it should be noted
that the code for many of these models is not publicly available, which makes reproducing
their results challenging. Among the models with publicly available code, those from the
DeepLab library [123] were particularly noteworthy when the study reported in the present
chapter was conducted, as they were made publicly available via a well-documented and
popular DeepLab repository2 and demonstrated high accuracy. Specifically, according to the
benchmarks published by the authors of the repository, the Xception-65 model achieved a
1https://www.cityscapes-dataset.com/benchmarks/#instance-level-results; accessed 6 Septem-
ber 2023
2https://github.com/tensorflow/models/tree/master/research/deeplab; accessed 6 September
2023
78
78.79% mIOU on the test subset after training on Cityscapes fine-labelled images3.
While the results achieved on the Cityscapes dataset with the state-of-the-art methods
for semantic segmentation are rather impressive, there is still room for further improvement.
One possible approach is to increase the amount of training data; however, this is difficult
due to the scarcity of semantically labelled data for self-driving cars caused by ‘the curse of
dataset annotation’ [241]. This is the major problem for training CNN models for semantic
segmentation in general, which extends beyond the particular task of improving the accuracy
of segmentation on the Cityscapes dataset and can potentially impede the development of
self-driving cars with a high degree of autonomy.
In response to this problem, a number of studies have used synthetic data for augmenting
segmentation datasets for self-driving cars. Ros et al. [242] used synthetic images from their
Synthetic collection of Imagery and Annotations (SYNTHIA), which represents a virtual
New York City modelled by the authors with the Unity platform, to augment real-world
datasets and improve semantic segmentation; Richter et al. [243] extracted synthetic images
and data for generating semantic segmentation masks from the video game Grand Theft Auto
(GTA) V and used the acquired synthetic dataset for improving the accuracy of semantic
segmentation; Hahner et al. [244] created a custom dataset of synthetic images of foggy street
scenes to enhance semantic scene understanding under foggy road conditions. However, there
are still several challenges in the use of synthetic data for self-driving cars. First, even high-
quality synthetic images are not entirely photorealistic and therefore are less valuable for
training than real-world images; therefore, it is often more reasonable to augment a real-
world dataset with synthetic data rather than train CNN models solely on synthetic data [9].
Second, generating synthetic data can require quite a lot of effort – at least at the stage of
initial design. Developing a pipeline for the generation of synthetic images can be challenging
in terms of time and effort involved; another approach, acquiring synthetic images from video
games, may also prove difficult, since the internal workings and assets of games are often hard
to access [243].
A promising alternative path for the generation of synthetic images is to utilise open-
source sandbox driving simulators such as the Open Racing Car Simulator (TORCS) [245]
or Car Learning to Act (CARLA) [11]. While such simulators are said to lack the extensive
content found in top-level video games [243], their open-source nature makes it easier to
access and modify them as well as imposes fewer (if any) legal constraints on the use of the
eventually generated data. Recently, Berlincioni et al. [246] made use of the data generated
with a driving simulator: the authors created their Media Integration and Communication
Center – Semantic Road Inpainting (MICC-SRI) dataset with CARLA to tackle the problem
of image inpainting [247, 248], i.e., predicting missing or damaged parts of an image by
inferring them from the context. However, it remains an open question whether the quality
of synthetic images generated with open-source driving simulators is sufficient for improving
the accuracy of CNNs on semantic segmentation, as this task demands particularly high-
quality training data.
3https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md;
accessed 6 September 2023
79
3.3 Street views datasets for semantic segmentation:
Cityscapes, MICC-SRI, and CCM
To investigate the use of synthetic data generated with open-source driving simulators to
augment real-world street view datasets, I used three datasets: the Cityscapes dataset [130]
of real-world images, the MICC-SRI dataset [246], composed of synthetic images generated
with the CARLA simulator, and the CCM (Cityscapes-CARLA Mixed) dataset, which I
created for the study reported in this chapter. These three datasets are described in detail
below; sample images from each dataset and their corresponding pixel segmentation masks
are shown in Figure 3.2.
Cityscapes [130] is one of the most well-known datasets of urban landscapes for self-
driving cars. It comprises a diverse set of images at a resolution of 1024× 2048 pixels taken
on the streets of 50 different European (predominantly German, with some Swiss) cities using
cameras mounted on a specially equipped car. Coarse semantic segmentation annotations are
available for 20 000 images, while fine (pixel-level) annotations are provided for 5 000 images.
In the present study, I used only the Cityscapes images with fine annotations. Furthermore,
I had to change the division of the original dataset into training, validation, and test subsets,
as due to the need to relabel the segmentation masks (see Section 3.4) to ensure compatibility
between the Cityscapes images and the synthetic images generated with CARLA, it was not
possible to benchmark the models used in this study on the original Cityscapes test set,
which is withheld from public access to ensure impartial benchmarking. I created a custom
split of the Cityscapes dataset as follows: out of the 3 475 Cityscapes fine-annotated images
available in public access, I used 2 685 for training, 290 for validation, and 500 for testing
CNN models. I adhered to the original Cityscapes policy that images from a particular
location should only appear in one split, i.e. in training, or validation, or test set, but not in
multiple splits. This approach ensures the strict separation between the data in each subset,
which was necessary to prevent a flawed methodology of testing a CNN model on data that is
too similar to the data it was trained on. Specifically, after randomly selecting the locations,
I used images taken in Frankfurt, Lindau, and Mu¨nster for the test dataset, images taken
in Bochum, Krefeld, and Ulm for the validation dataset, and the rest of the images for the
purpose of training the models.
The MICC-SRI dataset [246] consists of 11 913 synthetic RGB frames of urban driving
footage with the resolution of 600 × 800 pixels generated with the CARLA simulator [11];
for all RGB frames, semantic segmentation annotations are provided. The dataset was orig-
inally created for semantic road inpainting tasks, and the images are not photorealistic (cf.
Figure 3.2, b). As Berlincioni et al. [246] report, the frames for MICC-SRI dataset were
collected by running separate simulations for 1 000 frames from each spawning point on the
two available maps in CARLA version 0.8.2, the most recent version at the time of their
study. The simulations were run at 3 FPS; to ensure diversity of the generated data, the
simulations were subsampled to take an image every 3 seconds. Further processing reported
by the authors included removing occasional misalignment between the RGB frames and
segmentation masks. RGB frames and corresponding semantic segmentation annotations in
the MICC-SRI dataset are available in two versions, the one with the static objects only,
and the one with both static and dynamic (cars and pedestrians) objects. For the study
reported here, I used only the RGB images and segmentation masks containing both static
and dynamic objects.
My custom-made CCM dataset consists of 2 685 Cityscapes images as well as 46 935
synthetic images that I generated with CARLA simulator. The resolution of synthetic images
80
(a)
(b)
(c)
Figure 3.2: Sample images and their segmentation masks from: (a) the Cityscapes dataset;
(b) the MICC-SRI dataset; (c) the CCM dataset. Note that (a) and (c) are not to scale with
respect to (b), as the actual resolution of (a) and (c) is 1024×2048 pixels vs 600×800 pixels
for (b).
81
is 1024×2048 pixels, matching the resolution of the Cityscapes images. The synthetic images
were collected by running simulations on several maps available in the latest stable release of
CARLA (v0.9.12), namely, Town 1, Town 2, Town 3, Town 4, and Town 10. The simulations
were run at 1 FPS, with an RGB image and its corresponding segmentation mask saved every
second. To increase diversity, the simulation in Town 1 was run with the weather conditions
set to Clear Noon, whereas simulation in Town 10 was run with the weather conditions set
to Cloudy Noon; on the rest of the maps, the simulations were run with default settings.
To acquire images with a large number of dynamic objects, the simulations were run with
100 spawned vehicles and 200 spawned pedestrians. The number of images acquired in each
location is given in Table 3.1. The simulations took approximately 96 hours to complete on
a desktop PC with Windows 10 OS, an Intel i5-6400 CPU, and an NVIDIA 1060 GPU.
Table 3.1: The number of CARLA-generated images for the CCM dataset by location.
Location Images
Town 1 7 866
Town 2 3 838
Town 3 11 124
Town 4 11 484
Town 10 13 049
3.4 Data preprocessing
Data preprocessing consisted of relabelling semantic segmentation masks, resizing images,
and augmenting real-world Cityscapes data with synthetic data.
Relabelling semantic segmentation ensured compatibility between the annotation labels
of the Cityscapes images and the CARLA-generated images in both the MICC-SRI and CCM
datasets. Different label mappings were required for the two datasets, because the MICC-
SRI dataset was generated with CARLA v0.8.2, which had fewer labels than the more recent
CARLA v0.9.12 used for the CCM dataset. The label mapping between Cityscapes and
MICC-SRI was the same as in the experiments by Berlincioni et al. [246]; since it was not
detailed in the original publication, I obtained the mapping through personal communication
with Lorenzo Berlincioni, and it is provided in Table 3.2.
82
Table 3.2: Label mapping for experiments with the MICC-SRI dataset.
Cityscapes labels MICC-SRI labels Resulting labels for
augmented dataset
unlabeled, ego vehicle,
rectification border, out of
ROI, static, dynamic, rail
track, sky, license plate
none, other other
road road lines, roads roads
ground, sidewalk, parking sidewalk sidewalk
building buildings buildings
wall, fence, guard rail,
bridge, tunnel
fences, walls fences, walls
pole, polegroup, traffic
light, traffic sign
poles, traffic
signs
poles, traffic
signs
vegetation, terrain vegetation vegetation
person, rider pedestrian human
car, truck, bus, caravan,
trailer, train, motorcycle,
bicycle
vehicles vehicles
Furthermore, I provide the mapping between the Cityscapes labels and CARLA v0.9.12
labels, which I designed for the creation of the CCM dataset, in Table 3.3.
Image resizing was necessary for experiments on the Cityscapes and MICC-SRI datasets,
as the images in these datasets were of different sizes (1024 × 2048 pixels vs 600 × 800
pixels, respectively), whereas the input to a CNN typically needs to be uniform in size. One
possible solution was to upscale MICC-SRI images; however, since they have different height-
to-width ratios than Cityscapes images, 0.75:1 vs 0.5:1, upscaling would lead to significant
distortion and likely reduce the accuracy of semantic segmentation. Therefore, I took a
different approach, splitting each Cityscapes image into 9 smaller images, each sized 600×800
pixel, thus matching the size of the images in the MICC-SRI dataset. No resizing was
necessary for the experiments involving the CCM and Cityscapes datasets, as the synthetic
images for the CCM dataset were generated with the same size as the real-world images in
the Cityscapes dataset.
Finally, to investigate how the amount of synthetic data used for augmentation affects the
accuracy of semantic segmentation, I created three splits of the CCM dataset, each including
all the Cityscapes real-world images designated for training, along with 100%, 50%, and 25%
of the synthetic images that I generated with the CARLA simulator. The synthetic images
for these splits were selected randomly; to avoid unnecessary repetition, I will refer to these
splits as CCM-100, CCM-50, and CCM-25, respectively.
83
Table 3.3: Label mapping for experiments with the CCM dataset.
Cityscapes labels CARLA (v0.9.12)
labels
Resulting CCM
dataset labels
unlabelled, ego vehicle,
rectification border
unlabelled unlabelled
building building building
fence fence fence
tunnel, pole group other other
pedestrian, rider pedestrian, bike
rider
pedestrian & rider
pole pole pole
road road, roadline road
sidewalk, parking sidewalk sidewalk & parking
vegetation vegetation vegetation
car, truck, bus, caravan,
trailer, train, motorcycle,
bicycle
vehicles vehicles
wall wall wall
traffic sign traffic sign traffic sign
sky sky sky
ground ground ground
bridge bridge bridge
rail track rail track rail track
guardrail guardrail guardrail
traffic light traffic light traffic light
static static static
dynamic dynamic dynamic
terrain water, terrain water & terrain
3.5 Experiments: methodology and results
3.5.1 Methodology
I conducted semantic segmentation experiments using two CNN models from the DeepLabv3
library [118]: MobileNetV2 [105] and Xception-65 [117], which both were pretrained on the
PASCAL VOC 2012 dataset [57]. As mentioned in Section 3.2, DeepLab is a well-known
84
state-of-the-art library for semantic segmentation; I chose these two particular models from
it because MobileNetV2 is a compact and fast CNN, while Xception-65 is a larger CNN
that offers better segmentation accuracy at the cost of longer training times and greater
GPU memory requirements. The models were trained on Dell EMC PowerEdge C4140
high-performance computing (HPC) servers of the Riga Technical University HPC Center4
equipped with Intel Xeon Gold 6130 CPUs and NVIDIA V100 GPUs with 16 GB VRAM.
The models were trained using default settings: for MobileNetV2, the output stride was set
to 8, and the training crop size was 769×769 pixels; for Xception-65 models, the atrous rates
were set to 6, 12, and 18, the output stride was 16, the decoder output stride 4, and the
training crop size was 769× 769 pixels. The learning rate was set to 0.0001, and the training
was optimised using the SGD optimiser with a momentum value of 0.9.
I used only real-world images for validating and testing the models, i.e., in all exper-
iments, in all experiments, the Cityscapes validation set was used for validation, and the
Cityscapes test set for testing. The batch size for training was the maximum possible given
the GPU memory that was at my disposal: 4 images for MobileNetV2 models, and 2 im-
ages for Xception-65 models. For experiments on the MICC-SRI and Cityscapes datasets,
the MobileNetV2 models were trained for 1200 epochs on each dataset, and the Xception-
65 models were trained for 300 epochs on each dataset. For experiments on the CCM and
Cityscapes datasets, MobileNetV2 and Xception-65 models were trained for 200 epochs on
the Cityscapes dataset and on the CCM-100, CCM-50, and CCM-25 dataset splits. In total,
training the four models on the MICC-SRI and Cityscapes datasets took ≈ 1480 hours of
computing, while training the eight models on the Cityscapes dataset and CCM splits took
≈ 1415 hours of computing.
3.5.2 Results
I report the main results using the standard metrics for semantic segmentation: IoU and
mIoU. Similar to other authors, e.g. Cordts et al. [130], I report and include in the calcula-
tions only semantically meaningful classes, excluding classes such as Other or None. Since I
had to modify the labelling scheme from the one originally used in the Cityscapes dataset,
I was unable to directly compare the performance of the CNN models trained on the aug-
mented datasets with state-of-the-art results on the original Cityscapes dataset reported in
the literature, e.g., in Chen et al. [123]. Therefore, I compare the accuracy of the models
trained on the augmented datasets with CNN models of the same architecture that I trained
on the Cityscapes dataset using the accordingly modified labelling scheme.
Results on Cityscapes and MICC-SRI datasets
I summarise the main results of training MobileNetv2 and Xception-65 DNN models on the
Cityscapes and MICC-SRI datasets in Tables 3.4 and 3.5. As can be seen, augmentation
of Cityscapes dataset with MICC-SRI images did not improve the accuracy of semantic
segmentation; on the contrary, both MobileNetV2 and Xception-65 models performed slightly
better when trained only on real-world images than on the dataset augmented with synthetic
images, with an mIoU of 75.43% vs 75.11% for the MobileNetV2 model and an mIoU of
79.34% vs 78.81% for the Xception-65 model. The MobileNetV2 model trained only on the
real-world images performed better than its counterpart trained on the augmented dataset
across all segmentation classes, whereas in the case of the Xception-65 models, the only
class on which the model trained on the augmented data achieved a better result than its
4https://hpc.rtu.lv/; accessed 20 September 2024.
85
Table 3.4: Comparison of the accuracy (IoU) of semantic segmentation: MobileNetV2 trained
on the Cityscapes dataset vs. MobileNetV2 trained on the Cityscapes dataset augmented
with the MICC-SRI dataset.
Class Cityscapes Cityscapes augmented with MICC-SRI
Road 92.66 92.62
Sidewalk 67.02 66.61
Building 86.48 86.18
Fences and Walls 44.46 43.21
Poles and traffic signs 57.07 56.72
Vegetation 89.52 89.45
Pedestrians 76.59 76.54
Vehicles 89.65 89.55
Mean IoU 75.43 75.11
Table 3.5: Comparison of the accuracy (IoU) of semantic segmentation: Xception-65 trained
on the Cityscapes dataset vs. Xception-65 trained on the Cityscapes dataset augmented with
the MICC-SRI dataset.
Class Cityscapes Cityscapes augmented with MICC-SRI
Road 93.69 93.60
Sidewalk 71.78 72.70
Building 88.67 88.30
Fences and Walls 52.20 49.16
Poles and traffic signs 63.58 62.52
Vegetation 90.75 90.58
Pedestrians 81.75 81.39
Vehicles 92.29 92.24
Mean IoU 79.34 78.81
counterpart trained on the real-world images alone was Sidewalk. The likely explanation for
the worse performance of the models trained on the augmented dataset is the low photorealism
of images in the MICC-SRI dataset: while the quality of these synthetic images was sufficient
for the semantic road inpainting task, it proved inadequate for the more challenging task of
semantic segmentation. It is also worth noting that the Xception-65 models trained on
the Cityscapes and MICC-SRI datasets demonstrate better performance than MobileNetV2
models trained on the same datasets. This performance difference is likely due to the larger
size (i.e., a greater number of parameters) and the resulting better learning capacity of the
Xception-65 architecture.
Results on Cityscapes and CCM datasets
The results of training MobileNetV2 and Xception-65 models on the Cityscapes dataset and
the three splits of the CCM dataset – CCM-100, CCM-50, and CCM-25 – are reported in
Tables 3.6 and 3.7, respectively. As can be seen, for both CNN architectures, augmentation
with synthetic data improved semantic segmentation accuracy. For MobileNetV2, the model
trained on CCM-100 achieved an mIoU of 55.49%, the model trained on CCM-50 achieved an
86
Table 3.6: Comparison of the accuracy (IoU) of semantic segmentation: MobileNetV2 trained
on the Cityscapes, CCM-100, CCM-50, and CCM-25 datasets.
Class Cityscapes CCM-100 CCM-50 CCM-25
Building 75.54 79.39 80.17 79.18
Fence 00.02 21.49 24.47 17.82
Pedestrian &
Rider
69.23 67.94 68.92 69.38
Pole 10.48 38.51 38.54 36.81
Road 88.57 88.46 89.72 89.15
Sidewalk 54.51 57.24 59.79 58.62
Vegetation 83.90 86.14 86.41 85.54
Vehicles 82.20 82.72 82.72 82.78
Wall 0.00 19.90 23.81 15.95
Traffic Sign 0.00 35.42 34.10 25.30
Sky 82.80 85.32 85.83 85.85
Traffic
light
0.00 21.32 15.38 00.13
Water &
Terrain
33.61 37.50 38.73 38.20
Mean IoU 44.68 55.49 56.05 52.67
mIoU of 56.05%, and the model trained on CCM-25 achieved an mIoU of 52.67%, whereas the
model trained solely on Cityscapes images achieved an mIoU of 44.68%. The same trend was
observed in the experiments with the Xception-65 models: the model trained on CCM-100
achieved an mIoU of 63.14%, the model trained on CCM-50 achieved an mIoU of 63.87%,
and the model trained on CCM-25 achieved an mIoU of 64.46%, whereas the model trained
on Cityscapes images only achieved an mIoU of 57.25%. Interestingly, the best-performing
MobileNetV2 and Xception-65 models were not those trained on the CCM splits with the
largest amounts of synthetic data: the best-performing MobileNetV2 model was trained on
CCM-50, while the best-performing Xception-65 model was trained on CCM-25, the split
with the smallest amount of synthetic data. This suggests that using larger amounts of
synthetic data for augmentation does not necessarily result lead to better performance than
augmentation with smaller amounts of such data.
Another notable finding is that models trained on different splits of the CCM dataset
showed the best results (i.e., in comparison to other models) on different classes: thus, among
the MobileNetV2 models, the one trained on CCM-100 showed better accuracy than other
models on the classes Traffic Sign and Traffic Lights; the model trained on CCM-50
outperformed others on the classes Building, Fence, Pole, Road, Sidewalk, Vegetation,
Wall, and Water and Terrain; and the model trained on CCM-25 performed best on the
classes Pedestrian & Rider, Vehicles, and Sky. These differences are exemplified in Fig-
ure 3.3: as shown, the MobileNetV2 model trained solely on the real-world Cityscapes images
could not segment road signs and traffic lights; as for the Xception-65 model trained solely
on the real-world Cityscapes images, while it was able segment road signs, it could not dis-
tinguish between traffic lights and road signs, mistakenly classifying the former as the latter.
In contrast, the MobileNetV2 and Xception-65 models trained on the CCM-50 and CCM-25
splits, respectively, i.e., the best-performing models for these architectures, were capable of
87
Table 3.7: Comparison of the accuracy (IoU) of semantic segmentation: Xception-65 trained
on the Cityscapes, CCM-100, CCM-50, and CCM-25 datasets.
Class Cityscapes CCM-100 CCM-50 CCM-25
Building 84.94 85.10 85.08 85.62
Fence 37.20 40.19 40.19 43.44
Pedestrian &
Rider
78.08 76.42 76.94 77.92
Pole 45.08 48.50 48.75 49.26
Road 92.31 91.82 91.43 91.85
Sidewalk 65.80 67.21 67.09 69.88
Vegetation 87.71 87.00 87.59 87.85
Vehicles 89.86 88.82 89.63 89.63
Wall 23.29 28.88 27.69 31.63
Traffic Sign 44.42 50.89 55.83 56.14
Sky 85.13 88.79 89.70 90.34
Traffic
Light
0.00 43.64 42.19 44.13
Water &
Terrain
46.86 39.58 35.62 44.22
Mean IoU 57.25 63.14 63.87 64.46
segmenting road signs and traffic lights and distinguishing between them. These observa-
tions are consistent with the results in Table 3.6 and Table 3.7 regarding the capacity of
these models to segment objects from the Traffic Lights and Road Signs classes.
88
(a)
(b)
(c)
Figure 3.3: Semantic segmentation of a sample image with different models: (a) the original
image; (b) segmentation masks produced with MobileNetV2 models: trained on Cityscapes
images only (left) and on the CCM-50 split (right); (c) segmentation masks produced with
Xception-65 models: trained on Cityscapes images only (left) and on the CCM-25 split
(right). Note the differences in the ability of the models to accurately segment traffic lights
and road signs.
89
3.6 Concluding remarks
Semantic segmentation models are an essential part of the perception module of a self-driving
car: albeit they tend to have higher latency than object detectors due to their higher compu-
tational complexity, they provide finer-grained information about the shapes of surrounding
objects, allowing the decision-making system of the car to better sense the environment. One
of the main challenges in developing such semantic segmentation models is the availability
of data for training them: in addition to the need to acquire data, which is a notorious
problem in deep learning in general, data for supervised training of semantic segmentation
models need to be meticulously labelled, which consumes many human-hours. To tackle this
problem, I trained the MobileNetV2 and Xception-65 semantic segmentation models from
the DeepLabv3 library on a mix of real-world data and data generated with CARLA, an
open-source simulator for autonomous driving. I conducted two series of experiments. In the
first series, I trained the models on the Cityscapes dataset augmented with low-photorealism
images from the MICC-SRI dataset, generated using an older version of CARLA. In the
second series, I trained them on the Cityscapes dataset augmented with more photorealistic
synthetic images generated using a more recent version of CARLA. The main hypothesis of
the study, namely, that augmenting real-world data with synthetic data would result in the
improved accuracy of semantic segmentation of street views, was not confirmed in the first
series of experiments but was confirmed in the second series of experiments. As the main
difference between the data for augmentation in these series of experiments was the degree of
photorealism, I conclude that the crucial factor in these experiments that determined whether
synthetic data decreases or improves the performance of the CNN-based segmentation models
was how similar these images were to the real-world images in the target dataset, Cityscapes.
This study also demonstrated that setting up a pipeline for generating synthetic data
does not have to be costly or difficult, as CARLA allowed for generating a large amount of
synthetic data quickly and without much meddling with out-of-the-box installation of that
open-access simulator. However, there was also a downside to the use of the ready-made
solution for generating data: due to the disparities in the labelling systems of Cityscapes
and CARLA, it was necessary to merge some class labels to create the CCM dataset, and
because of that, it was not possible to submit the models trained on CCM for evaluating on
the official Cityscapes semantic segmentation benchmarks. While that did not prevent me
from comparing the accuracy of the models trained on non-augmented vs augmented datasets,
as I resorted to setting aside a part of the available data for testing the models, it was an
unfortunate obstacle for comparing these models with their counterparts developed by other
researchers. Although it was not possible to overcome that obstacle at the time when the
study was conducted, the more recent releases of CARLA5 do not pose this problem anymore,
as the labels in this simulator are now fully compatible with the labels of Cityscapes. As the
versions of CARLA released since then also feature numerous improvements of assets such
as new town maps and vehicles, the use of CARLA-generated data for improving semantic
segmentation on Cityscapes has become even more relevant and remains a promising direction
for future work.
Another research problem worth investigating further is determining the optimal amount
of synthetic data for augmenting a real-world dataset to achieve the best semantic segmen-
tation accuracy. One possible answer to this question is ubiquitous in experimental science,
which deep learning indeed is: ‘it depends’. Indeed, since the optimal amount of the synthetic
data depends on many factors – for instance, whether its distribution covers well the classes
5Cf. e.g. release notes for v0.09.14 - https://carla.org/2022/12/23/release-0.9.14/; accessed 10
April 2024.
90
underrepresented in the real-world dataset, or whether synthetic data represent both fore-
ground classes (i.e., objects) and background classes, or just background classes (cf. [249])
– it is possible that a trial-and-error approach will always be necessary. On the other hand,
post hoc explanations of what ratio of synthetic data was the best are suboptimal both from
the intellectual standpoint and because of practical considerations, as training multiple model
instances on different ratios of the augmented data is time-consuming and computationally
expensive. Therefore, discovering general trends – such as the already mentioned observa-
tion that more photorealistic data typically yield better results – appears to be a promising
direction for future research. While it seems unlikely that a robust, definitive solution to the
problem of how much synthetic data to use for augmentation will emerge, I hypothesise that
a promising approach is to modify the training of the model so as to take into account the
domain gap between the real-world and synthetic data. In a way, that was already done in
the study I detailed in this chapter, as I used only real-world data for validating the model
during the training in order to ensure that the model performs robustly on the data in the
target domain. More complex approaches worth investigating further may involve additional
modifications to the training procedure of semantic segmentation models.
91
Chapter 4
Object Detection for a Bin-Picking
Task
In this chapter, I present a study on the use of CNN-based models for detecting graspable
items in a pile of objects. Detected coordinates of such items can then be utilised by an
industrial robot to pick these objects up, a task commonly referred to as bin-picking [250] in
robotics.
As visual object detectors are often an essential component of the perceptual system of
industrial robots, the work reported in this chapter is relevant to the design and development
of efficient robotic systems for manufacturing, sorting, and packing goods. Furthermore, as
the same approach was used to obtain the data for training object detectors as the approach to
obtain the data for training semantic segmentation models described in the previous chapter,
that is, synthetic data were used for that purpose, the research that I report here is relevant
to the broader challenge of overcoming the problem of the limited availability of data for
training CNN models.
The present chapter is based on the following conference paper:
[251] D. Duplevska, M. Ivanovs, J. Arents, and R. Kadikis, “Sim2Real image translation
to improve a synthetic dataset for a bin picking task,” in 2022 IEEE 27th Interna-
tional Conference on Emerging Technologies and Factory Automation (ETFA), pp.
1–7, IEEE, 2022.
As a co-author of the above study, I was primarily responsible for training and validating
CNN-based object detectors as well as contributed substantially to writing and editing of
the manuscript. The experiments aimed at enhancing the quality of the synthetic data using
Generative Adversarial Networks (GANs; [12]) were led by my collaborator Diana Duplevska.
4.1 Introduction
Adoption of robotic systems in manufacturing and other industrial operations has been grow-
ing steadily for decades now [252, 253], as they help to increase productivity, provide con-
sistent output, ensure quality control, save costs, and improve safety in industrial settings.
An essential component of such a system is a perception module (Figure 4.1), which supplies
it with information about the environment, enabling its efficient functioning. The percep-
tion module typically handles image understanding tasks such as object localisation, image
segmentation, and, particularly relevant for the topic of this chapter, object detection.
Recent advancements in object detection tasks for robotics, as detailed in a survey by
Bai et al. [255], have been driven by the adoption of well-established and widely recognized
92
Figure 4.1: Overview of the architecture of a robotic arm . Adapted from [254].
CNN-based visual object detectors, such as Faster R-CNN [133], SSD [143], and YOLO [47].
However, for the successful design and deployment of a perception module based on a deep
neural network, a large amount of training data is needed. Due to the specialised nature
of tasks for industrial robots, acquiring such large training datasets poses a challenge, since
images for these tasks are often not readily available in the public domain. Furthermore,
after the initial training of a CNN model, the need for additional training data can recur: in
particular, this can happen when changes in the environment of the deployed system cause
a domain shift, making the original training data inadequate for representing the task the
robot needs to perform.
A promising solution to the problem of the availability of data for designing perceptual sys-
tems in robotics is the use of simulations and synthetic data [253]. Leveraging synthetic data
offers several advantages, including accelerating the design cycle, generating large amounts
of data at low cost, and providing safe, fully controlled testing environments [256]. However,
a number of unresolved challenges may affect robotic systems trained on synthetic data when
they are deployed in real-life settings: in particular, their efficiency tends to decrease due to
the differences between a simulation or synthetic data on the one hand and the real world
on the other hand.
The problem of the gap between synthetic and real-world data is particularly relevant
for the visual domain. On the one hand, the photorealism of rendered images, videos, and
computer games has been improving steadily, and the tools for creating virtual environments
with integrated physics simulation are becoming more user-friendly and accessible: for ex-
ample, since such tools as Blender1, Unity2, and Unreal Engine3 are available free of charge,
researchers working on AI, computer vision, and robotics use them more frequently to gener-
ate data for training models or train systems directly in virtual environments. On the other
hand, a decrease in precision is often observed in models trained exclusively on synthetic
data compared to those trained on real data [257]. In robotic systems, this concern is closely
related to the issue that is commonly referred to as ‘the reality gap’ [258]: even though
good performance can be achieved in simulations, trained models may perform unreliably
when transferred to a real environment. Therefore, developing methods for a more robust
1https://www.blender.org; accessed 19 September 2024.
2https://unity.com; accessed 19 September 2024.
3https://www.unrealengine.com; accessed 19 September 2024.
93
sim-to-real translation is a pivotal research topic in robotics.
The research reported in the present chapter is concerned with improving the accuracy
of detecting plastic bottles with high visibility (i.e., those on the top of the pile of bottles)
so that a robotic arm would be able to grasp them more efficiently. This specific task
was set forth in the EDI part of the project Intelligent Motion Control under Industry 4.E
– IMOCO4.E ; from the practical perspective, it was envisaged as a contribution to the
design of a robotic arm operating on a real-world production line, whereas from the scientific
perspective, this work continued the exploration of the use synthetic data in robotics at EDI
(see e.g. [259, 260, 261]). To detect graspable bottles, I used YOLOv5 [153], one of the most
popular CNN-based object detectors; due to the already mentioned challenges with acquiring
real-world data for training DNNs for robotics, I used synthetic data for that purpose. The
main goal of the study reported in this chapter was to improve the accuracy of detecting
plastic bottles with YOLOv5 object detector by enhancing the degree of photorealism of
synthetic images. In particular, to make synthetic images more photorealistic, image-to-
image translation with GANs [12] was leveraged. The main hypothesis of the study was
that enhancing synthetic images of plastic bottles with GANs before training the YOLOv5
object detector on them would result in higher object detection accuracy compared to using
unmodified synthetic images for training.
The rest of the present chapter is organised as follows. In Section 4.2, I outline related
work; in Section 4.3, I describe the initial datasets used in the present study; in Section 4.4,
I detail the methodology and the results of the experiments; finally, in Section 4.5, I offer
concluding remarks.
4.2 Related work: object detection and sim-to-real
translation for robotics
Despite some initial scepticism about the prospects of adopting deep learning to help robots
to make sense of their environment (see [262] for an overview), DNN-based approaches to
image understanding tasks rapidly became prevalent in robotics soon after their emergence
in computer vision. Since industrial robotic systems operate in dynamic environments and
need real-time information to function efficiently and safely, the object detection component
of the perception module of such systems needs to be both accurate and fast. The second
of these requirements, the need for fast inference with an object detector deployed as part
of a robotic system, suggests considering single-stage rather than two-stage object detectors
when designing an efficient robot. By design, single-stage object detectors prioritise speed
by detecting objects in a single pass of the input image through the network, which makes
them particularly suitable for applications involving real-time processing. Although single-
stage object detectors may be less accurate than two-stage object detectors (see e.g. a
comprehensive comparison by Carranza-Garc´ıa et al. [263]), as the latter employ a separate
region proposal step followed by object classification and bounding box refinement, the critical
requirement for fast decision-making in industrial robotic systems may favour the adoption
of single-stage detectors despite the potential trade-off in accuracy.
One of the most popular single-stage object detectors is the YOLO family, starting with
YOLOv1 in 2016 [47] and continuing up to the most recent (at the time of writing) YOLOv8
[135] and YOLO-NAS [136] (see Section 1.3.2 and a comprehensive survey by Terven et al.
[156] for an overview). YOLO object detectors have been successfully applied to a wide
variety of robotics use cases: to mention just a few examples, Tian et al. [264] used YOLOv2
to develop an object grasping system for a humanoid robot; Cao et al. [265] customised a
94
lightweight Tiny YOLOv2 [139] to detect shuttlecock for a badminton-playing robot; Zhaoxin
et al. [266] designed a robotic system for picking tomatoes based on YOLOv5 and validated
it through simulation. A number of studies (see, e.g., [267, 268, 269]) have also used various
versions of YOLO as a component of the perception module of a robotic arm system. In the
study reported in this chapter, I followed their approach and employed YOLOv5, the most
recent version of YOLO at the time the study was conducted, as a state-of-the-art CNN
model for object detection.
Regardless of how advanced the architecture of a CNN model for image understanding
is, its performance can degrade when trained only on synthetic data, as the reality gap is a
topical issue for the applications of ML [270]. To bridge or at least narrow the gap in the
visual domain, one can attempt to transfer the appearance of a real-world environment to
artificial data.
The main types of domain adaptation approaches are feature-level transfer [271] and pixel-
level transfer [272, 273]. Feature-level transfer is concerned with learning domain-invariant
features between source and target domains, whereas pixel-level transfer, which is typically
based on image-conditioned GANs [12], focuses on image styling, i.e., images from a source
domain are made to resemble images from a target domain. Overall, these methods can be
used to address the simulation-to-reality domain shift in robotic manipulation tasks.
Despite the advantages offered by the domain adaptation approaches outlined above,
they also have a substantial drawback, namely, they may alter images so as to cause the
loss of information essential for a given task. This issue is particularly problematic for tasks
involving object detection and robotic manipulation, since the complete or partial erasure of
objects or their features in the training dataset can degrade the performance of the system.
For this reason, it is common to combine domain adaptation with additional techniques such
as semantic maps of simulated images [274], which preserve the semantics relevant for the
task. Furthermore, additional loss functions can be introduced. Thus, reinforcement learning
(RL) task loss [275] enforces consistency of task policy Q-values between the original and
transferred images to preserve information important to a given task: RL-CycleGAN is
trained jointly with the RL model and requires task-specific real-world episodes. Another
approach is to add an object detector with perception consistency loss, which penalizes
the generator for discrepancies in object detection between translations. RetinaGAN [276],
which is based on this approach, uses object detection to ensure consistency across different
domains.
In the work reported in this chapter, we4 used pixel-level domain adaptation based on
the use of GANs. To improve the quality of synthetic data, we transferred the domain from
the synthetic to the real one (sim-to-real translation); our objective was to preserve objects
in the images and keep them as recognisable as possible.
4.3 Initial datasets
The starting point for our experiments was two datasets: a real-world bottle image dataset
and a synthetic bottle image dataset. I briefly describe both below.
4In the parts of this chapter concerned with synthetic data generation, I use ‘we’ rather than ‘I’, as work
on that topic was led by my colleague Diana Duplevska.
95
(a) (b)
Figure 4.2: Sample images from the real-world (a) and original synthetic (b) dataset.
4.3.1 Real-world dataset
The dataset of real-world bottle images (see an example in Figure 4.2 (a)) was previously
created by our colleagues at EDI. The real-world data were collected by placing bottles
randomly in a plastic container and taking an image of the resulting visual scene. For
each captured image, the positions of the bottles were altered by emptying the container and
refilling it. Additionally, the camera exposure time and lighting intensity were systematically
varied to acquire data with a high diversity of lighting conditions.
In total, the dataset consists of 2 060 real-world images with corresponding manual la-
belling of the objects. Of these images, 1 760 images were used for training GAN models
for sim-to-real transfer (see Section 4.4.1), and the remaining 300 images comprised the test
dataset for object detectors (see Section 4.4.3).
4.3.2 Synthetic dataset
For experiments in the study described in this chapter, a synthetic dataset, as described by
Arents et al. [260], was generated; it consisted of 8800 photorealistic, high-resolution images
(see an example in Figure 4.2 (b)) of bottles in a box. These images were used in object
detection experiments (see Section 4.4.3) both in their initial form, i.e., exactly as rendered,
and after sim-to-real translation with GANs as described in Section 4.4.1.
To generate the original synthetic images, the Blender physics simulation engine [277] was
used. An initially empty box in the simulation was filled with bottles that were randomly
dropped into it for each scene. This approach allowed for the realistic generation of random
bottle configurations within the container. After filling the box with bottles, the intensity of
four different light sources was varied in the simulation, and the scene was rendered from 16
different angles. Additionally, for increased realism, Blender shader nodes, realistic textures,
reflections, and indirect light bounces were used.
Finally, an annotation file was generated for each rendered scene; it included every object
in the scene, its rotation angle, coordinates, and visibility percentage. Bottles were considered
96
graspable if their visibility was above 60%.
4.4 Experiments: methodology and results
4.4.1 Sim-to-real transfer with CycleGAN
CycleGAN: overview of architecture and training procedure
We employed GANs [12] for image-to-image translation to enhance the photorealism of the
synthetic dataset described in Section 4.3 and thus narrow the sim-to-real gap. For this
purpose, we selected a specific GAN architecture, the Cycle-Consistent Adversarial Network
(CycleGAN) [278], since it was designed to find a mapping from the domain X to the domain
Y without requiring the training data to consist of matching image pairs. The absence of
such a requirement was particularly important for our work, as otherwise, we would have
had to create matching pairs of real-world and synthetic images, a task that would have been
rather cumbersome.
CycleGAN employs two mapping functions: G : X → Y to translate images in domain
X to domain Y , and an inverse mapping F : Y → X for translating images in domain Y
back to domain X. Additionally, there are two discriminator functions DX and DY , which
are used to determine whether an image is in a respective domain; in this case, X corresponds
to synthetic data, and Y corresponds to real-world data. Each mapping function, along with
its associated discriminator function, has a generative adversarial loss. Furthermore, the
inverse mapping introduces a cycle consistency loss to ensure F (G(X)) ≈ X and otherwise
G(F (Y )) ≈ Y . A schematic overview of the CycleGAN architecture is presented in Figure 4.3.
Figure 4.3: CycleGAN data generation algorithm. Reproduced from [278].
We began our experiments using the CycleGAN implementation in TensorFlow5 on the
datasets described in Section 4.3. The primary difference between the original CycleGAN
proposed by Zhu et al. [278] and the TensorFlow implementation is that the former employs
a modified ResNet-based generator [76], whereas the latter is based on a modified U-Net
[279] generator for simplicity. U-Net (see Figure 4.4) is a convolutional autoencoder with
skip connections: the encoder downscales the image using convolutional layers, whereas the
decoder upscales the latent space back to the original dimensions. The goal of adding a skip
connection to each transposed convolutional layer in the decoder is to mitigate the vanishing
5https://www.tensorflow.org/tutorials/generative/cyclegan; accessed 10 February 2024.
97
gradient problem by concatenating the output of a layer to multiple layers rather than just
one.
Figure 4.4: U-Net [279] network architecture. Reproduced from [251].
To conduct experiments with CycleGAN, several data preprocessing steps were necessary.
First, from the synthetic dataset containing 8 800 images, we selected only those with good
lighting to ensure that the features of the objects were clearly visible, facilitating training
of neural networks. As a result, the synthetic dataset used for training CycleGAN consisted
of 1 760 images. Second, we cropped the backgrounds in the synthetic images to preserve
important parts of the images before resizing (see the next step). Third, we resized the
images, as the CycleGAN requires images of the same size for training. The original real-
world photos were 528×342 pixels, whereas the synthetic images were 1024×768 pixels; after
resizing, all images were 256 × 256 pixels. This reduction in resolution was also necessary
due to the limited GPU resources that were at our disposal for training CycleGAN.
The training procedure was consistent across all experiments with CycleGAN: each model
was trained for 20 epochs using the Adam optimiser with an initial learning rate of 0.0002, β
= 0.5 and λ = 10. The weights were initialized with a Gaussian distribution with a mean of
0 and a standard deviation of 0.02. For each epoch, the dataset was shuffled, and the buffer
size was set to 1000.
Datasets generated with CycleGAN
As a result of training CycleGAN to transfer the style of the real-world photos of bottles to
the synthetic images, several datasets were generated. These datasets are described below;
sample images from each dataset are shown in Table 4.1.
Baseline CycleGAN dataset. In the first experiment, CycleGAN was trained with the
parameters listed above without any additional adjustments. As shown in Figure 4.5, the
resulting neural network correctly translated and drew the shape of the box; however, the
bottles that were in the shadow in the original synthetic image were partially erased after
the transfer. Furthermore, the neural network drew non-existent bottles in the top part of
the resulting image. We considered the quality of such sim-to-real transfer insufficient for
image understanding tasks, and for this reason, this dataset, which I refer to as the Baseline
CycleGAN dataset, was not used for training object detectors.
Augmented CycleGAN dataset. To improve the quality of the generated images, we
added several data preprocessing functions: in addition to image resizing, jittering, mirroring
98
(a) (b)
Figure 4.5: Sim-to-real transfer results: (a) a synthetic image used as input; (b) an output
image generated using Baseline CycleGAN. Note artefacts of sim-to-real translation – some
non-existent bottles are drawn by CycleGAN in the top left corner of the generated image.
and data normalisation, we used colour data augmentation techniques, namely, random con-
trast, brightness, hue, and saturation. The aim of using these preprocessing functions was to
reduce overfitting. We also added a central crop while maintaining the image size of 256×256,
which allowed the neural network to focus on the bottles rather than the background and
box.
Augmented noise CycleGAN dataset. In an attempt to further improve the quality
of the sim-to-real translation, we added Gaussian noise to the discriminator input to make
it more difficult for the discriminator to evaluate images, thus allowing the generator to
train longer. This method helps to avoid early overfitting of the discriminator; an overfitted
discriminator would evaluate even high-quality images generated by the network as fake,
leading to an imbalance in the neural network.
Resized convolution CycleGAN dataset. While we considered the Augmented Cycle-
GAN dataset suitable for training object detectors, there were still some observable image
defects, eliminating which could potentially further improve the quality of generated images.
The most noticeable problem was a small pixel grid in the image, known as checkerboard
artifacts [280], which occurs because of the transposed convolution [281] operations in the
decoder. Therefore, to solve this problem, it was necessary to replace the transposed convolu-
tion operation with some other operation. One possible solution was to resize the images and
then add a regular convolutional layer; we chose a different approach, namely, we replaced
the transposed convolution with a TensorFlow function tf.keras.layers.UpSampling2D to
upscale the images. By using such resizing with convolution, we removed the checkerboard
artifacts, yet the images became blurred, which is a substantial problem for object detection,
because when the edges of objects blur, their boundaries become imprecise.
Resized transpose CycleGAN dataset. To address the issue of the blurred edges that
we identified in the images in Resized convolution CycleGAN dataset, we applied resized
convolution on all layers except the last one, where we retained the transposed convolution.
This method reduced the blurriness, and the checkerboard grid became hardly noticeable.
99
4.4.2 Evaluation of datasets generated with CycleGAN with FID
score
We used the Frechet Inception Distance (FID) score [282] to evaluate the quality of the
images in the datasets generated with different versions of CycleGAN. The FID score is a
performance metric that calculates the distance between the feature vectors of original images
(in this case, either the real-world images, or the synthetic images as originally rendered with
Blender) and the feature vectors of enhanced images, i.e., images generated by a GAN. It is
given by the following formula:
FID = ∥µo − µe∥2 + Tr(Σo + Σe − 2
√
ΣoΣe), (4.1)
where µo is the mean vector of the feature distribution for the original images, µe is the mean
vector of the feature distribution for the enhanced images, ∥ · ∥ is the Euclidean distance
between the mean vectors, Σo is the covariance matrix of the feature distribution for the
original images, Σe is the covariance matrix of the feature distribution for the enhanced
images, and Tr is the trace of a matrix. Importantly, a lower FID score indicates a smaller
difference between the original and the enhanced images, implying better sim-to-real transfer
results.
To obtain FID scores, we used the TensorFlow-GAN (TF-GAN) library6. In total, 5 000
images were used; this amount of data was obtained using several augmentation techniques:
random horizontal and vertical flips, rotations, and hue changes. The FID scores were calcu-
lated by comparing CycleGAN-generated images with the real-world images (FID A score)
and the original synthetic images (FID B score); these comparisons allowed us to investigate
the similarity between the enhanced images and their original synthetic versions as well as
how similar they were to the real-world images, i.e., the target domain. All the FID scores
and sample images from each dataset are shown in Table 4.1. As the results show, the worst
FID scores for both FID A and FID B were obtained for the Baseline CycleGAN dataset.
When comparing enhanced images with the real-world images (FID A), the best FID score
was obtained for the Resized transpose CycleGAN dataset; however, when comparing the en-
hanced images with the original synthetic images (FID B), the best FID score was obtained
for the Resized convolution CycleGAN dataset.
Table 4.1: Evaluation of sim-to-real transfer with CycleGAN using FID score.
Dataset Baseline Augmented Augmented noise Resized convolution Resized transpose
Images
FID A 230.28 171.41 137.27 141.03 112.26
FID B 264.81 169.73 152.29 122.86 127.25
FID A evaluates the distance between the images enhanced with CycleGAN and the real-world images;
FID B evaluates the distance between the images enhanced with CycleGAN and the original (i.e., as
rendered with Blender) synthetic images.
For further investigation, we chose two different types of CycleGAN modified from the
6https://github.com/tensorflow/gan; accessed 18 May 2024.
100
basic CycleGAN – Augmented noise and Resized transpose – and compared the quality of
large images produced with them. These two types of CycleGANs have different methods
for image resizing in the CNN part, which can affect image enhancement in different ways.
Thus, on small input images, defects are imperceptible and do not affect the FID score,
but on large images, defects and artefacts may appear that affect the image semantics. We
applied Augmented noise and Resized transpose CycleGAN to the original synthetic images
with a resolution of 1024× 768 pixels and bright lighting. As a result, we created two more
datasets: Augmented noise CycleGAN 1024× 768 dataset, and Resized transpose CycleGAN
1024 × 768 dataset. We could not compare these datasets using the FID A score with the
real-world images due to their differing sizes and semantics; therefore, we only compared
them with the original synthetic images by calculating FID B score to investigate how the
image quality changed after enhancement with CycleGAN. The results of the comparison
and sample images are provided in Table 4.2.
Table 4.2: Evaluation of sim-to-real transfer with CycleGAN using the FID score for images
with a resolution of 1024× 768 pixels and constant brightness level.
Dataset Augmented noise 1024× 768 Resized transpose 1024× 768
Images
FID B 58.46 69.24
As the lower FID scores in Table 4.2 compared to those in Table 4.1 indicate, the quality
of images with the resolution 1024 × 768 pixels after sim-to-real transfer was better than
that of the images with the resolution of 256 × 256 pixels. Furthermore, when comparing
the Augmented noise CycleGAN 1024 × 768 dataset with the Resized transpose CycleGAN
1024 × 768 dataset, it can be seen that the pixel grid in the former was less noticeable,
resulting in a smaller impact on the FID score than the blurred objects without the pixel
grid in the latter dataset. As a result, the CycleGAN 1024 × 768 dataset had a better FID
score than the Resized transpose CycleGAN 1024× 768 dataset.
While the increase in the resolution of enhanced images resulted in an improved FID
score, it should be noted that these results were obtained by conducting experiments only on
large images with excellent lighting, yet the subsequent object detection experiments were
to be conducted on images with different brightness parameters. To address this issue, two
additional datasets were created by enhancing original synthetic images with various light-
ing conditions, ranging from very bright to dark, and different light source positions, using
Augmented noise CycleGAN and Resized transpose CycleGAN. When applying Augmented
noise CycleGAN to these images, the semantics of the images did not change, though some
artifacts appeared in the background. However, when Resized transpose CycleGAN was used
on images with medium brightness, artefacts appeared in the form of pixels around the ob-
jects (bottles and the box), which could interfere with the semantics of the images such as
101
the shapes of the objects. The dataset in question had a relatively high FID score, suggesting
that the image quality was worse in this case and would likely result in decreased precision
for the object detection task. Therefore, despite the consideration that it would be useful to
account for varying lighting conditions, the datasets with the constant rather than varying
brightness level were used for the object detection experiments reported in Section 4.4.3.
Table 4.3: Evaluation of sim-to-real transfer with CycleGAN using the FID score for images
with a resolution of 1024× 768 pixels and varying brightness level.
Dataset Augmented noise 1024× 768 with varying
brightness levels
Resized transpose 1024× 768 with varying
brightness levels
Bright images
Dark images
FID B 80.11 124.83
4.4.3 Object Detection Experiments
Methodology
I conducted object detection experiments using the the YOLOv5 object detector [153] im-
plemented in the Ultralytics library7. The experiments were conducted on each dataset with
the YOLOv5 Small, Medium, and Extra Large models pretrained on the MS COCO dataset
[141]; as suggested by the model names, they differ by the number of parameters: 7.2 million,
21.2 million, and 86.7 million, respectively.
The models were trained on three datasets: the original dataset of synthetic images,
the Augmented noise CycleGAN 1024 × 768 dataset, and the Resized transpose CycleGAN
1024× 768 dataset. During the training, 90 percent of the images in each dataset were used
for training, and 10 percent were used for validation. The models were trained with the
following parameters: an input image size of 640× 640 pixels, a batch size of 16, a learning
rate of 0.01, a momentum of 0.937, and a weight decay of 0.0005. Each model was trained for
300 epochs with early stopping after 100 epochs if there was no validation loss improvement.
7https://github.com/ultralytics; accessed 11 November 2024.
102
After training, the checkpoint with the best performance on the validation data was retrieved
and tested on a set 300 real-world test images.
Results
I report the results of the object detection experiments in Table 4.4, using standard metrics
for the task of object detection, namely, precision, recall, and mean average precision (mAP)
for bounding boxes. mAP is calculated for an intersection over union (IoU) threshold of 0.5
as well averaged for IoU threshold from 0.5 to 0.95 in steps of 0.05.
As shown in Table 4.4, the models trained on the Augmented noise CycleGAN 1024×768
dataset consistently outperformed both the models trained on the original synthetic dataset
and the models trained on the Resized transpose CycleGAN 1024× 768 dataset in terms of
precision, recall, and mAP for IoU with threshold of 0.5 and mAP averaged for IoU thresholds
∈ [0.5 : 0.05 : 0.95]. In contrast, the performance of the models trained on another enhanced
dataset, Resized transpose CycleGAN 1024 × 768, was consistently worse than that of the
models trained on the original synthetic data, with the sole exception being the better recall
of the Extra Large model. Another noteworthy observation is that when comparing the
performance of different model sizes on the same datasets, larger models did not consistently
outperform their smaller counterparts: for instance, both the mAP for IoU with threshold of
0.5 and mAP averaged for IoU threshold ∈ [0.5 : 0.05 : 0.95] were better for YOLOv5 Small
trained on the Augmented noise CycleGAN 1024×768 dataset than for YOLOv5 Extra Large
trained on the same dataset. This may indicate that the larger object detectors employed in
the experiments were overfitting on the training dataset, which was comparatively small in
size.
Table 4.4: Results of the object detection experiments with YOLOv5 on the original synthetic
data and synthetic data enhanced with CycleGAN.
Model Dataset Precision Recall
mAP
(threshold
0.5)
mAP
(avg for IoU
∈ [0.5 : 0.05 :
0.95])
Original synthetic 68.9 89.4 74.2 44.7
YOLOv5
Small
Resized transpose
1024× 768 61.5 83.8 63.2 24.5
Augmented noise
1024× 768 73.1 90.5 77.7 48.0
Original synthetic 69.3 80.2 72.0 42.0
YOLOv5
Medium
Resized transpose
1024× 768 58.1 78.2 58.9 20.4
Augmented noise
1024× 768 71.2 91.0 75.3 45.5
Original synthetic 71.8 78.5 75.0 41.3
YOLOv5
Extra Large
Resized transpose
1024× 768 68.3 86.7 71.5 33.9
Augmented noise
1024× 768 72.1 87.2 76.1 46.1
103
4.5 Concluding remarks
The goal of the study reported in this chapter was to improve the accuracy of detecting plastic
bottles for a bin-picking task by enhancing the photorealism of synthetic images using Cycle-
GAN. In our initial experiment, we employed the original implementation of CycleGAN in
TensorFlow for sim-to-real image translation and generated the Baseline CycleGAN dataset;
in subsequent experiments, we used improved versions of the CycleGAN architecture and
generated several other datasets: Augmented CycleGAN, Augmented noise CycleGAN, Re-
sized convolution CycleGAN, and Resized transpose CycleGAN datasets. Evaluation of these
datasets by means of the FID score demonstrated that the best sim-to-real transfer results
were achieved with the Resized transpose CycleGAN and Augmented noise CycleGAN mod-
els. To further enhance the images, we used the Resized transpose CycleGAN and Augmented
noise CycleGAN models to generate images with higher resolution, namely, 1024×768 pixels,
compared to 256 × 256 pixels in prior experiments. As a result, improved FID scores were
achieved for these new datasets, with the Augmented noise CycleGAN dataset obtaining a
better FID score than the Resized transpose CycleGAN dataset. However, our attempts to
further improve the FID score by applying sim-to-real transfer not only to bright images, as
in the case of the Resized transpose CycleGAN 1024× 768 and Augmented noise CycleGAN
1024 × 768 datasets, but also to images with varying lighting conditions and light source
positions, were unsuccessful, yielding worse FID scores due to new artefacts. Therefore, the
best results for sim-to-real transfer in our study were achieved with the Resized transpose
CycleGAN 1024×768 and Augmented noise CycleGAN 1024×768 datasets, which were then
used in object detection experiments.
The goal of the object detection experiments was to compare the object detection accuracy
of models trained on the original Blender-generated synthetic data with those trained on
enhanced synthetic images. I trained the YOLOv5 object detectors of three different sizes on
the original synthetic dataset, the Resized transpose CycleGAN 1024× 768 dataset, and the
Augmented noise CycleGAN 1024 × 768 dataset; after training, the models were tested on
real-world images. While the object detectors trained on the Resized transpose CycleGAN
1024 × 768 dataset performed worse than those trained on the original synthetic data, the
models trained on the Augmented noise CycleGAN 1024 × 768 dataset outperformed the
models trained on the original synthetic data across all metrics of interest: precision, recall,
and mAP. These results demonstrated that CycleGAN can be successfully used for sim-to-real
translation of synthetic datasets for bin-picking tasks.
The experiments reported in this chapter are the first step in the overall development of
a bin-picking pipeline. While the object detectors trained in these experiment detect objects
that are the most promising for a successful grasp in 2D images, the proposed approach can
also be utilised for further steps such as instance segmentation and grasp pose estimation in
3D. Furthermore, the synthetic data generated with Blender contain sufficient 3D information
to be used by different approaches, such as directly performing 6D object pose estimation
(see e.g. [283]). However, a detailed plan of how the approach developed in this study can
contribute to these steps is beyond the scope of this chapter and is envisaged as future work.
104
Chapter 5
Image Classification for Monitoring
the Growth of Organs-on-a-Chip
In this chapter, I describe the development of CNN-based image classifiers for monitoring
the growth of organs-on-a-chip (OOC; see [284] for a comprehensive survey), a promising
emerging technology in biomedical research. Given that both the initial and final datasets
of OOC microscopy images for training and evaluating CNN models were relatively small, as
is often the case with studies involving biomedical image understanding, I augmented them
with synthetic data, attempting to improve the accuracy of the classifiers. Therefore, in
addition to exploring the applications of deep learning methods in biomedicine, this chapter
also further examines the approach of training CNNs on a mix of real-world and synthetic
data, which I began to explore in Chapters 2, 3, and 4.
This chapter builds upon the following scientific papers:
[285] M. Ivanovs, L. Leja, K. Zviedris, R. Rimsa, K. Narbute, V. Movcana, F. Rumnieks,
A. Strods, K. Gillois, G. Mozolevskis, A. Abols, and R. Kadikis, “Synthetic image
generation with a fine-tuned latent diffusion model for organ on chip cell image clas-
sification,” in 2023 Signal Processing: Algorithms, Architectures, Arrangements, and
Applications (SPA), pp. 148-153, IEEE, 2023.
[286] V. Movcˇana, A. Strods, K. Narbute, F. Ru¯mnieks, R. Rimsˇa, G. Mozol¸evskis, M.
Ivanovs, R. Kadik¸is, K. Zviedris, L. Leja, A. Zujeva, T. Laimin¸a, and Arturs Abols,
“Organ-On-A-Chip (OOC) Image Dataset for Machine Learning and Tissue Model
Evaluation”, Data, vol. 9, issue 2, 2024.
As the first author of [285], I was responsible for planning and conducting experiments
with both real-world and synthetic data. I also led the analysis of experimental results as well
as had a principal role in drafting, revising, and finalising the manuscript in collaboration
with other co-authors. As the lead author from the field of Computer Science in [286], I
ensured that the image dataset published as an integral part of that publication was suitable
for ML purposes; I was also in charge of validating CNN models on that dataset.
5.1 Introduction
Recently, CNNs have successfully been applied to a broad range of biomedical image under-
standing tasks, such as the segmentation of magnetic resonance images (MRI) for diagnosing
Alzheimer’s disease [287], the segmentation of endoscopic and nuclei images [288], and the
105
classification of skin lesions [289] and chest X-ray images [290]. Remarkably, they have
even outperformed human experts in some of these tasks [291]. In the work reported in the
present chapter, I used CNNs to classify biomedical microscopy images to monitor the growth
of OOC. To underscore the practical importance of this task, I provide an insight into the
research on OOC in the following.
OOC technology combines tissue engineering and microfluidics to imitate key aspects of
human physiology, with the aim of recreating the environment of particular human organs in
vitro1. The primary assumption is that certain functions of a human organ can be replicated
by growing respective cells (e.g., gut epithelial cells to imitate intestines, or lung epithelial
and endothelial cells to imitate lungs) in horizontal microfluidic channels separated by a
porous membrane. In an OOC setup, culture media flows over both sides of the membrane,
ensuring that cells are supplied with nutrients and metabolic waste is removed. Furthermore,
the flow exerts shear stress on the cells, similar to that exerted on cells in a living system,
thereby creating a more physiologically plausible environment.
A successfully operating OOC setup needs to ensure that the tissue models are representa-
tive of their originals both morphologically and functionally; to achieve this, daily monitoring
of the samples is essential. Such monitoring can be improved and automated by using a ma-
chine learning system-controlled rather than a human-supervised OOC cultivation system,
which would reduce the operating costs of the system and increase its output.
One of the key requirements for the design of such a system as defined in the project
AI-Improved Organ on Chip Cultivation for Personalised Medicine – AImOOC, in which the
research reported in this chapter was carried out, was the ability to classify the condition
of the cell sample on a chip (accessible via microscopy images acquired with the camera
attached to an optical microscope) as ‘good’, ‘acceptable’, or ‘bad’ (Figure 5.1). Depending
on the classification outcome, the intensity of the flow of the solution to the chip should
be maintained, or increased, or the experiment should be stopped altogether if it is deemed
unproductive to continue.
(a) (b) (c) (d) (e)
Figure 5.1: Schematic overview of the automated system for monitoring the growth of OOC:
(a) a chip for growing OOC; (b) an OOC setup; (c) an image of OOC tissue acquired with
a light microscope; (d) an image classifier; (e) an output of the classifier.
Given that state-of-the-art results on image classification tasks have been achieved with
CNNs, using a CNN-based classifier as a part of the automated system in question is a promis-
ing solution. However, when implementing such a system, the issue of data availability for
training a CNN model inevitably arises, as the number of images for training and evaluating
CNN models in the AImOOC project was expected to be comparatively small, ranging from
several hundred to several thousand, because of the current throughput of non-automated
OOC systems. Furthermore, to the best of my knowledge, at the time when the AImOOC
project was carried out, there were no publicly available datasets that would be suitable for
1I.e., not in a living organism, but rather in a controlled artificial environment.
106
that task, as the images obtained with the OOC setup designed by the EDI partners in the
AImOOC project had a rather specific appearance (see a sample image in Figure 5.1).
To address the issue of limited data availability, I adopted the promising approach of using
a large generative model for image synthesis. Such highly popular large text-to-image models
as Midjourney [157], DALL·E [158] and Stable Diffusion [13] have demonstrated impressive
capabilities in generating artificial images for various purposes – from book illustrations to
visuals for advertising campaigns to synthetic X-ray images [292] – and are available in im-
plementations that are quite easy to use. However, one should take into account that these
models have not originally been trained on the data that would make them capable of gen-
erating such domain-specific data as OOC biomedical images; furthermore, retraining them
from the ground up would require a lot of computational resources and therefore would be
prohibitively expensive and time-consuming. Nevertheless, it is possible to produce synthetic
counterparts of the real-world OOC images by such means as e.g. image-to-image translation
as well as to fine-tune large generative models on a small amount of the data on consumer-
grade computer hardware employing such recently proposed methods for that as low-rank
adaptation (LoRA; [293]).
The goal of the research reported in the present chapter was to develop a classifier for
OOC microscopy images. To achieve it, two studies were conducted. In the first study,
I trained EfficientNet-B7 [107] CNN on both the initial real-world dataset of microscopy
OOC images and the augmented dataset that included synthetic images generated with
the Stable Diffusion model fine-tuned with LoRA. In the second study, the team of ML
researchers that I led trained EfficientNet-B7 and MobileNetV3Large [106] CNNs on a larger
real-world dataset of microscopy OOC images [286] and the augmented dataset, in which real-
world images were supplemented with their synthetic counterparts generated with various
generative AI methods available for the Stable Diffusion model, such as image-to-image
translation, inpainting with masks, interpolation, and fine-tuning with LoRA. As in the
research reported in Chapter 3, experiments with augmented datasets in both studies reported
in this chapter involved adding different proportions of synthetic data to the real-world
dataset to investigate how the proportion of synthetic data would affect the performance
of the classifier. For both studies that I report in the following, I proposed the same two
hypotheses, namely:
• Hypothesis 1: A CNN-based classifier achieves better accuracy on the real-world
microscopy OOC image dataset than a putative ‘naive’ classifier.
• Hypothesis 2: The classification accuracy on the real-world microscopy OOC image
dataset improves when a CNN-based classifier is trained on the dataset augmented
with synthetic data generated with the Stable Diffusion model rather than solely on
the real-world image dataset.
The structure of the rest of the chapter is as follows. In Section 5.2, I discuss related
work; in Section 5.3, I describe the first study, which was conducted on the initial OOC image
dataset; Section 5.4.1 is concerned with the second study, which was conducted on the final
OOC image dataset; finally, in Section 5.5, I offer some concluding remarks.
107
5.2 Related work
5.2.1 Synthetic data for training CNN models for biomedical
image understanding tasks
The use of synthetic data for training CNNs for various image understanding tasks in the
biomedical domain has recently become increasingly popular, as it can help address the
scarcity of real-world data, both in terms of overall dataset size and the underrepresentation
of certain classes. Beyond the common challenges of using synthetic data, such as bridging
the gap between synthetic and real-world domains, there are additional challenges specific
to biomedical image understanding tasks. In particular, while many other domains have
readily available image synthesis tools – e.g., it is possible to acquire artificial images of
street views in video games [243, 294], or generate them with a car driving simulator, as I did
in the research reported in Chapter 3 – such solutions are often unavailable for more niche
biomedical purposes. Furthermore, biomedical images tend to be high-resolution, which
may require more powerful hardware and longer computation times to generate synthetic
data. Yet another challenge is evaluating generated synthetic images: while the quality
of general-purpose synthetic images – such as faces, cars, street views, human bodies and
other common scenes and objects – can at least preliminary be evaluated visually without
additional expertise, evaluating artificial biomedical images may require expert knowledge,
which can lengthen the iteration cycle for improving synthetic data.
Some of the popular methods for generating synthetic data for biomedical purposes are
Variational Autoencoders (VAEs; [295]) and Generative Adversarial Networks (GANs). VAE
is a generative model that combines the autoencoder architecture with Bayesian approaches
to encode the inputs into a latent space and then generate new data by decoding from
it. Since VAEs can efficiently model large datasets, they have been successfully used to
generate biomedical data such as brain MRI [296] and endoscopic images [297]. However,
VAEs also tend to produce blurry output due to learning non-informative latent codes [298]
and unrealistic distributions of prior vs posterior data [299]. GANs, which were used in the
research reported in Chapter 4, have also been successfully applied to a number of biomedical
image understanding tasks, such as generating blood cell images [300] and images of retinal
blood vessels [301]. However, as we could see in Chapter 4, adapting a GAN to a specific task
can require a number of iterative changes by trial and error, and all in all, training GANs
has a reputation in the ML community for being challenging due to the frequently occurring
issues such as mode collapse, instability of the model, and non-convergence [302]. Therefore,
while these approaches, especially the use of GANs, remain a lively area of research with
substantial potential for applications in biomedicine, it may be beneficial to explore other
robust and simpler means of image synthesis.
5.2.2 Large generative models for image synthesis
A promising alternative to the smaller generative architectures mentioned in the previous
section is the use of large generative models for image synthesis. These DNN-based models,
trained on very large datasets containing a broad range of imagery, can generate images with
complex semantics and a high degree of photorealism in response to simple text prompts given
in natural language. Some of the most popular large text-to-image models are Midjourney
[157], DALL·E [158], and Stable Diffusion [13]. In addition to the high quality of output,
another advantage of these models is their ease of use, as they are accessible via user-friendly
interfaces, either through online platforms or, in the case of Stable Diffusion, locally as well,
108
and do not require any advanced skills to run them in a standard configuration.
Several studies have already leveraged large text-to-image models to generate biomedical
images: Akrout et al. [303] generated images of skin diseases with Stable Diffusion; Ali et al.
[292] created realistic X-ray and computed tomography synthetic images of lungs with Stable
Diffusion and DALL·E 2; Chambon et al. [304] generated X-ray images of lungs with a fine-
tuned Stable Diffusion model. The main challenge in applying large text-to-image models
for generating biomedical images is adapting these models to the biomedical domain. This
adaptation is necessary due to the specific semantics of target images, which differ from the
general nature of images these models were originally designed to produce. Therefore, it may
be more practical to use open-source large text-to-image models instead of proprietary ones,
as the former offer more options for modifying (e.g., by means of fine-tuning) the foundational
model. Since the most popular open-source generative AI model is currently Stable Diffusion
[13], a latent text-to-image diffusion model, I chose to use it in the studies reported in this
chapter and describe it in the following.
5.2.3 Stable Diffusion model for image generation
Diffusion models, first introduced in the context of generative modelling by Sohl-Dickstein et
al. [305], are deep latent variable models that generate new data, such as images, by means of
a two-stage process (see Figure 5.2): in the forward diffusion stage, Gaussian noise is added
to the training data, slowly corrupting it, whereas in the reverse diffusion stage, the original
input data is recovered by gradually reversing the diffusion process [306].
Figure 5.2: The two stages of diffusion model training: the forward and the reverse diffusion
stage. Reproduced from [292].
More formally, the forward diffusion stage, starting from the original data x0 ∼ q(x0) and
proceeding for T iterations, is defined as
q(xt | xt−1) = N (xt;
√
1− βtxt−1, βtI) [307], (5.1)
q(x1:T | x0) =
T∏
t=1
q(xt | xt−1) [307], (5.2)
109
where the hyperparameter βt (for t ∈ {1, . . . , T}), the variance scheduler, is a small pos-
itive constant that controls the amount of noise added at step t and is set such that xT
approximates a standard Gaussian distribution [307]. The reverse diffusion stage is defined
as
pθ(xt−1 | xt) = N (xt−1;µθ(xt, t),Σθ(xt, t)) [307], (5.3)
pθ(x0:T ) = p(xT )
T∏
t=1
pθ(xt−1 | xt) [307], (5.4)
where θ are parameters learned by the diffusion model during the training.
To make it possible to impose conditions (for instance, a text prompt for text-to-image
translation) on the data generation by the diffusion model, a set of conditions c can be
incorporated into the reverse diffusion stage, which therefore becomes
pθ(xt−1 | xt, c) = N (xt−1;µθ(xt, t, c),Σθ(xt, t, c)) [307]. (5.5)
The training of the diffusion model involves minimising a variational lower-bound on the
negative log-likelihood:
Lvlb = − log pθ(x0 | x1)+KL (p(xT | x0)∥π(xT ))+
∑
t>1
KL (p(xt−1 | xt,x0)∥pθ(xt−1 | xt)) [306],
(5.6)
where KL stands for the Kullback-Leibler divergence between the true distribution p and
the model distribution pθ. As Ho, Jain, and Abbeel [308] demonstrated, the above training
objective can be effectively replaced by a simpler one, namely,
Lsimple = Et∼[1,T ]Ex0∼p(x0)Ezt∼N (0,I) ∥zt − zθ(xt, t)∥2 [306], (5.7)
where E denotes the expected value, and zθ(xt, t) is the network’s prediction of the noise at
time step t.
While diffusion models have achieved impressive results in image generation tasks, they
have a notable shortcoming: due to their complexity, substantial computational resources
are required both for training and for performing inference (i.e., generating synthetic data)
[13]. Therefore, as Esser et al. observe [309], work on making these stages of the life cycle
of diffusion models less computationally demanding has recently become a growing area of
research. One of the most notable recent studies on that topic was conducted by Rombach
et al. [13], who replaced the computationally expensive pixel space that diffusion models
typically operate upon with the more compact lower-dimensional space that is perceptually
equivalent to the original one. To achieve this, the training was divided into two distinct
phases: first, an autoencoder was trained on the original data, removing imperceptible de-
tails from the data space and producing its low-dimensional latent representation; second, a
diffusion model was trained on this latent space. More formally, the encoder E encodes an
image x ∈ RH×W×3 into a latent representation z = E(x), and the decoder D subsequently
reconstructs the image from z, returning x˜ = D(z) = D(E(x)) (z ∈ Rh×w×c) [13]. E down-
samples the input image by a factor f = H
h
= W
w
[13]; Rombach et al. [13] empirically found
that factor f in the range 4 ≤ f ≤ 16 strikes a good balance between computational effi-
ciency on the one hand and the quality of the output of their latent diffusion model (LDM)
on the other hand. Additionally, to implement parametrization of the LDM with conditions
110
to enable text-to-image synthesis, they employed the Bidirectional encoder representations
from transformers (BERT) tokenizer [310] and a transformer [311], which infer a latent code
that can then be mapped onto the backbone of the LDM, the U-Net CNN [279], by means
of a cross-attention mechanism [13].
Stable Diffusion [312] is an updated and more efficient version of the original implemen-
tation of the LDM [313]. It features a number of performance-improving differences over the
original LDM; in particular, instead of the BERT tokenizer, it employs Contrastive Language-
Image Pre-training (CLIP) text encoder [314]. Remarkably, this open-source model can be
installed locally and run in inference mode on a machine with a GPU that has at least 10 GB
of VRAM [312], which makes it accessible to a broad range of users. Furthermore, the release
of Stable Diffusion with a browser-based user interface [315] has made interacting with the
model even easier and allows users to employ various techniques for generating images (e.g.,
image interpolation, image-to-image translation, and inpainting; details follow) and methods
for extending the LDM. Therefore, I chose the implementation in [315] for the experiments
on augmenting real-world OOC image datasets with synthetic images reported in the rest of
the chapter.
5.3 Experiments on the initial OOC image dataset
In the first study, I conducted experiments on the initial dataset of OOC microscopy images.
In the following, I describe the real-world OOC image dataset, the process of generating
synthetic data to augment it, and the methodology and results of experiments with CNN-
based classifiers.
5.3.1 The initial OOC image dataset
To obtain data for the design and validation of a CNN-based classifier, cells were cultivated
in a custom-made OOC setup. Several cell lines were used for that: to create a gut-on-a-chip
model, Caco-2 (colorectal adenocarcinoma epithelial cells, HTB-37, ATCC, Manassas, VA,
USA) and HUVEC (human umbilical vein endothelial cells, CRL-1730, ATCC) cell lines were
used to mimic epithelial and endothelial cell layers; for developing a lung cancer-on-a-chip
model, A549 (human lung adenocarcinoma alveolar basal epithelial cells, CCL-185, ATTC)
and HPMEC (human pulmonary microvascular endothelial cells; 3000, ScienCell, Carlsbad,
CA, USA) were used; for lung-on-a-chip modelling, the HSAEC (human small airway epithe-
lial cells, PCS-301-010, ATCC) epithelial cell line was used. The cell seeding density varied
and was specific to each cell type to achieve optimal attachment to the membrane of the chip
channel and its coverage. Cells were typically cultivated in the OOC setup for ≈ 8 days,
although some chips were cultivated for up to 22 days. Various flow rates ranging from 0.7
µL/min to 2.77 µL/min were used for cell cultivation.
For imaging the cell tissue, an automated OOC brightfield microscopy setup developed
by Cellbox Labs Ltd. (Riga, Latvia) was used. The imaging system consisted of a high-
resolution camera IM Compact M (IC10-05q32MU3101, Opto GmbH), and precise control of
chip movement was achieved using a precision XYZ motion system. Live imaging and image
acquisition were done using OptoViewer software (Opto GmbH), while the movement of the
XYZ stage was controlled with Zaber Launcher software (Zaber Technologies).
The acquired images were saved in a private repository and indexed in a table containing
a unique identifier for each image, along with data on initial cell seeding density, media flow
rate, and cultivation length. The resulting initial dataset of OOC images was rather small,
111
consisting of 822 images, which were labelled into three classes with imbalanced distribution:
the ‘good’ class with 500 images, the ‘acceptable’ class with 212 images, and the ‘bad’ class
with 110 images. The ground truth classification of the images was performed by experienced
cell biologists based on expected cell morphology and density, while also considering and tak-
ing into account the corresponding cell line and the duration of cultivation. The distribution
of the images by classes and cell types is shown in Figure 5.3.
Figure 5.3: Distribution of images by cell lines and classes in the initial OOC dataset.
5.3.2 Synthetic data for augmenting the initial OOC image dataset
To augment the initial OOC image dataset with synthetic data, I fine-tuned Stable Diffusion
using the method of low-rank adaptation (LoRA; [293]). A remarkable advantage of LoRA is
that even a small number of images – just a few dozen – is typically sufficient for fine-tuning
a large generative model. Fine-tuning the Stable Diffusion model with LoRA targets the
cross-attention layers by incorporating low-rank matrices (hence the name of the method)
that specifically adjust a subset of the weights in these layers. In particular, after fine-tuning,
the original weight matrix W ∈ Rd×k is transformed into its updated version W ′ = W +∆W,
where ∆W is the update matrix. The efficiency of fine-tuning with LoRA is due to the
low rank of the update matrix, which is achieved by rank decomposition ∆W = B ×A with
B ∈ Rd×r, A ∈ Rr×k, and the rank r ≪ min(d, k) [293]. SinceW is frozen during fine-tuning,
and only A and B are trained, and since the size of A and B is smaller than that ofW , LoRA
is very efficient. Thus, while the original Stable Diffusion-v-1-1 model was trained on a cluster
consisting of 32 nodes with 8 NVIDIA A100 GPUs per node2, it is possible to fine-tune it
with LoRA using the free version of the Google Colab environment3. The fine-tuned model
2https://huggingface.co/CompVis/stable-diffusion-v-1-1-original#training; accessed 29
September 2024.
3https://colab.research.google.com/; accessed 29 September 2024.
112
can then be used for generating images in response to the text prompts provided by the user.
The procedure for generating synthetic data involved splitting the dataset of the real-
world images into 5 folds (see Section 5.3.3 ) and using each fold to create three LoRA
models (i.e., one for each of the classes, ‘good’, ‘acceptable’, and ‘bad’) using a popular
open-source implementation4 of LoRA for that. All in all, 15 LoRA models were generated
with the following parameters: 2 repeats for each image, training for 10 epochs, training
batch size equal to two, U-Net learning rate equal to 0.0005, text encoder learning rate
equal to 0.0001. These LoRA models were then used together with the Stable Diffusion web
UI model [315] to generate synthetic data. The Stable Diffusion web UI model was used
with the following parameters: Euler A sampler, 20 sampling steps, classifier-free guidance
(CFG) scale of 7, random seed. I generated two datasets of synthetic images with these
parameters, the difference between them being that for one dataset, I set the LoRA weight
to 1.0, whereas for the other dataset, I set this parameter to 0.8. A higher LoRA weight
results in generated images that are more similar to the original ones, whereas a lower weight
increases the variability of the generated images, making them less similar to the original
ones. The creation of these two distinct datasets made it possible to explore the impact of
different levels of LoRA weights on the performance of the image classifier trained on the
augmented dataset.
The examples of the generated images are shown in Figure 5.4. Fine-tuning Stable Dif-
fusion with LoRA made it possible to generate images that are somewhat similar to their
authentic counterparts. In particular, the granularity of both authentic and synthetic ‘good’
images in the left column differs from that of the ‘bad’ images in the right column, which
corresponds to the presence of the cells attached to the medium they were grown on in
the former case compared to cells that have not attached to the medium in the latter case.
However, due to the rather specific nature of the OOC images, it was unclear from visual
inspection alone whether the degree of similarity would be sufficient to improve the accuracy
of a CNN-based classifier trained on these data.
5.3.3 Experiments with CNNs on the initial dataset
I conducted experiments by training the EfficientNet-B7 [107] CNN model, available in the
Keras framework [78], which was pretrained on the ImageNet [19] dataset. The architecture
of the model was as follows:
• input layer with a resolution of 600× 600 pixels;
• data augmentation layer with random rotations (factor=0.25), random translations
(height factor=0.1, width factor=0.1), random flips, and random contrast (factor=0.1);
• baseline EfficientNet-B7 model with its weights frozen;
• GlobalAveragePooling2D layer;
• BatchNormalization layer with dropout (rate=0.2);
• Dense layer with 3 neurons with the softmax activation function.
The training of each model consisted of 30 epochs, using the Adam optimiser (learning
rate=0.001), with sparse categorical crossentropy loss as the loss function. The dataset was
partitioned into five folds, with each fold serving as a holdout fold for 5-fold cross-validation.
4https://civitai.com/models/22530; accessed 1 September 2024.
113
(a) (b) (c)
(d) (e) (f)
Figure 5.4: Sample OOC images. Top row – real-world images from the initial OOC image
dataset: (a) class ‘good’; (b) class ‘acceptable’; (c) class ‘bad’. Bottom row – synthetic images
generated with Stable Diffusion fine-tuned with LoRA: (d) class ‘good’; (e) class ‘acceptable’;
(f) class ‘bad’.
When training on the augmented dataset, I ensured that synthetic data was generated using
LoRA models created on only the training data, excluding the data from the respective
holdout fold to prevent information leakage from the training data to the validation data.
The training was conducted on a PC with 8 GB RAM, an Intel i5-2500K CPU, an NVIDIA
RTX 3090 GPU with 24 GB VRAM, and Ubuntu 18.04.6 LTS OS.
The results of experiments on datasets augmented with synthetic data generated using a
LoRA weight of 1.0 are reported in Table 5.1, while those using a LoRA weight of 0.8 are
reported in Table 5.2. Furthermore, the results of the experiments with the model trained
solely on the real-world alone are also provided in both tables for reference. As can be seen,
the baseline CNN model achieved an accuracy of 72.9%, which is better than the accuracy
of a putative ‘naive’ classifier on the given dataset, namely, 60.8%, corresponding to the
percentage of the largest class. However, augmenting the real-world image dataset with
synthetic images resulted in the deterioration rather than an improvement in classification
accuracy, with a trend of decreasing accuracy as the percentage of synthetic data used for
augmentation increased.
5.4 Experiments on the final OOC image dataset
In the second study, the team of EDI researchers that I led conducted experiments by training
CNNs on the final dataset of OOC microscopy images. Below, I describe the dataset, the
114
Table 5.1: The results of evaluating EfficientNet-B7 image classifiers trained on the datasets
augmented with the synthetic data generated with the LoRA weight of 1.0.
Dataset Precision Recall Accuracy
Synthetic only (100%) 61.7 60.8 61.4
Real-world & 100% synthetic 70.7 68.7 69.9
Real-world & 75% synthetic 70.2 68.0 69.6
Real-world & 50% synthetic 72.0 69.7 71.0
Real-world & 25% synthetic 71.9 69.9 70.7
Real-world & 10% synthetic 72.9 70.1 72.1
Baseline (real-world only) 73.1 71.5 72.9
Table 5.2: The results of evaluating EfficientNet-B7 image classifiers trained on the datasets
augmented with the synthetic data generated with the LoRA weight of 0.8.
Dataset Precision Recall Accuracy
Synthetic only (100%) 63.0 59.5 62.1
Real-world & 100% synthetic 70.9 68.5 70.1
Real-world & 75% synthetic 71.8 69.6 70.7
Real-world & 50% synthetic 69.8 67.9 69.3
Real-world & 25% synthetic 70.9 69.0 70.4
Real-world & 10% synthetic 72.9 70.2 71.8
Baseline (real-world only) 73.1 71.5 72.9
methodology of generating synthetic data to augment it, and the methodology and results
of the experiments with CNN classifiers.
5.4.1 The final OOC image dataset
The final dataset of OOC images consists of 3072 images, incorporating the initial dataset de-
scribed in Section 5.3.1, and was acquired using the same methodology as the initial dataset.
However, it also features some key differences, namely:
• based on the recommendation of the biology experts involved in the AImOOC project,
the three-class labelling (‘good’, ‘acceptable’, and ‘bad’) was replaced by a straightfor-
ward binary classification: ‘good’ and ‘bad’.
• in addition to the five cell lines in the initial dataset, the final dataset includes an ad-
ditional line: NHBE (normal human bronchial epithelial cells, CC-2541, Lonza, Basel,
Switzerland).
The distribution of the images by classes and cell types is shown in Figure 5.5.
115
Figure 5.5: Distribution of images by cell lines and classes in the final OOC dataset.
5.4.2 Synthetic data for augmenting the final OOC image dataset
To augment the final OOC image dataset with synthetic data, the research team that I
led employed several methods: image-to-image translation, inpainting with masks, image
interpolation, and fine-tuning with the LoRA. I detail these methods below.
Image-to-image translation is such a transformation of an input image into an output
image that the latter both retains some features of the former and acquires some new features.
Using a latent diffusion model, the input image is first encoded into a latent representation
that captures its essential features. This representation is then modified to some extent
and decoded back into the image space, thus producing an output image. Leveraging the
Stable Diffusion model for image-to-image translation, the original images were processed for
30 sampling steps using the Diffusion Probabilistic Model Second-Order Multistep Improved
(DPM++ 2M) Karras sampler [316], a CFG of 7, and a denoising strength of 0.3. As a result,
a counterpart was generated for each original image. Sample images obtained through this
method are shown in Figure 5.6.
116
(a) (b)
(c) (d)
Figure 5.6: Examples of image-to-image translation with Stable Diffusion: (a) a real-world
input image, class ‘good’; (b) an output image, class ‘good’; (c) a real-world input image,
class ‘bad’; (d) an output image, class ‘bad’.
Inpainting with masks involves selectively modifying an input image by applying a
mask to designate which parts of the image should be altered. In this study, the masks con-
sisted of black and white stripes, with white stripes designating the areas to be inpainted (i.e.,
replaced) with synthetic data, and black stripes designating the areas to remain unchanged.
As a result, a partially new image was produced. To implement inpaining with masks using
the Stable Diffusion model, the input images were processed for 50 sampling steps with the
DPM++ 2M Karras sampler, a CFG of 7, and a denoising strength of 0.3. The resulting
sample images are shown in Figure 5.7.
117
(a) (b) (c)
(d) (e)
Figure 5.7: Example of inpainting with masks with the Stable Diffusion model. Top row: (a)
a real-world input image; (b) a vertical mask; (c) an output image. Bottom row: close-ups of
crops (100× 100 pixels) from the top left corners of the same (d) input image and (e) output
image, illustrating the difference between them due to inpainting.
The goal of image interpolation is to create an intermediate image between two or more
input images. In the case of Stable Diffusion, interpolation is performed in the latent space;
in particular, since the latent space learned by Stable Diffusion is a continuous manifold,
it is possible to move along the path connecting two or more images while remaining on
the manifold, and each intermediate step on that path will therefore be a valid image as
well [317]. For the implementation of image interpolation in this study, Stable Diffusion was
extended with a publicly available script5, and the model processed the input images for 20
steps using the DPM++ 2M Karras sampler with the denoising strength set to 0.25, allowing
the output images to preserve salient details without introducing noise artefacts. The CFG
scale was fixed at the default value of 7, and no prompt was used. Sample images obtained
through interpolation are shown in Figure 5.8.
5https://github.com/DiceOwl/StableDiffusionStuff/blob/main/interpolate.py; accessed 10 Au-
gust 2024.
118
(a) (b) (c)
Figure 5.8: Example of interpolation with Stable Diffusion: (a) the first input image; (b) the
second input image; (c) the output image.
For image generation after fine-tuning Stable Diffusion with LoRA, two LoRA
models were created: one for the ‘bad’ class, and one for the ‘good’ class. The parameters
for creating the LoRA models were as follows: two repeats for each image, training for
10 epochs, a training batch size of two, U-Net learning rate of 0.0005, and text encoder
learning rate of 0.0001. The LoRA models were then used as fine-tuning extensions of the
web UI implementation of Stable Diffusion to generate synthetic images with the following
parameters: the DPM++ 2M Karras sampler, 20 sampling steps, a CFG scale of 7, a random
seed, and a LoRA weight set to 0.7. Overall, the above parameters were similar to those
used in the study on the initial dataset reported in Section 5.3.2, with the main difference
being that in the study on the final dataset, the more advanced DPM++ 2M Karras sampler
was used instead of the default Euler A sampler. Sample images generated by means of this
approach are shown in Figure 5.9.
Figure 5.9: Examples of images generated after fine-tuning Stable Diffusion with LoRA: the
top row - class ‘good’, the bottom row - class ‘bad’.
119
5.4.3 Experiments with CNNs on the final dataset
Prior to training CNNs on the dataset, all images were standardised by centre-cropping from
their original size of either 2056× 1542 or 2048× 1536 pixels to a uniform size of 512× 512
pixels. This resizing met the input requirements of the pretrained CNN models; furthermore,
focusing on the central areas of the images eliminated redundant edge regions, which often
contained repetitive patterns and peripheral anomalies such as black lines and blurriness
that did not contribute to model training. After resizing, the dataset was split into training,
validation, and test subsets with a ratio of 70/10/20, ensuring that all classes, cell lines,
and time points after seeding (0-1 days, 2-3 days, 4 days, and 4+ days) were proportionally
represented in each split.
After preprocessing the images and splitting the dataset, the team that I lead conducted
experiments with two CNN models provided in the Keras framework [78]: EfficientNet-
B7 [107] and MobileNetV3Large [106]. Both models were pretrained on the ImageNet [19]
dataset. The architecture of the models was as follows:
• input layer with a resolution of 600× 600 pixels;
• data augmentation layer with random rotations (factor=0.25), random translations
(height factor=0.1, width factor=0.1), random flips, and random contrast (factor=0.1);
• baseline model: either EfficientNet-B7, or MobileNetV3Large. The weights of the
foundational model were initially frozen;
• GlobalAveragePooling2D layer;
• BatchNormalization layer with dropout (rate=0.2);
• Dense layer with a single neuron with a sigmoid activation function for binary classifi-
cation.
Training of EfficientNet-B7 was divided into two stages. First, the model was trained for
30 epochs with the weights of the foundational EfficientNet-B7 model frozen to preserve the
features learned during pretraining on ImageNet. Afterwards, the model was fine-tuned: the
top 20 layers were unfrozen, and the model was trained for additional 30 epochs. During
both stages, the Adam optimiser was set to a learning rate of 0.0001 with a decay rate of
0.0001.
Training of MobileNetV3Large began with 30 epochs with the weights of the foundational
model frozen; the Adam optimiser was used with a learning rate of 0.0001 and a decay rate
of 0.0001. After this initial phase, the fine-tuning involved unfreezing the last 15 layers and
reducing the learning rate to 0.00001 to prevent overfitting. The model was then trained
for additional 170 epochs with an early stopping callback, halting training if there was no
improvement in validation accuracy over 30 consecutive epochs.
Results
To test Hypothesis 1 and obtain baseline results for further performance comparison, both
models, EfficientNet-B7 and MobileNetV3Large, were first trained on the original dataset
of real-world OOC microscopy images without synthetic augmentation. The results of the
experiments are provided in Table 5.3. EfficientNet-B7 achieved an accuracy of 83%, while
MobileNetV3Large achieved an accuracy of 81%. Both results are better than the accuracy
of a putative ‘naive’ classifier, which is equivalent to the size of the largest class – 56%. The
120
superior performance of EfficientNet-B7 compared to MobileNetV3Large can be attributed
to the larger model’s better ability to generalise.
Table 5.3: Classification results for the EfficientNet-B7 and MobileNetV3Large models
trained on the real-world images.
Model Precision Recall Accuracy
EfficientNet-B7 83 77 83
MobileNetV3Large 79 78 81
The results of experiments on the dataset augmented with synthetic images generated by
image-to-image translation are shown in Table 5.4. As can be seen, the highest accuracy
for EfficientNet-B7 was achieved by the model trained on the dataset augmented with 100%
of synthetic data, whereas the highest accuracy for MobileNetV3Large was with the model
trained on the dataset augmented with 25% of synthetic data.
Table 5.4: Classification results for the EfficientNet-B7 and MobileNetV3Large models
trained on the dataset augmented with synthetic data generated by image-to-image transla-
tion.
Model Dataset Precision Recall Accuracy
EfficientNet-B7
Real-world & 100% synth 82 84 85
Real-world & 75% synth 81 77 82
Real-world & 50% synth 84 77 83
Real-world & 25% synth 84 75 83
Baseline (real-world only) 83 77 83
MobileNetV3Large
Real-world & 100% synth 79 78 81
Real-world & 75% synth 78 79 81
Real-world & 50% synth 76 83 81
Real-world & 25% synth 80 79 82
Baseline (real-world only) 79 78 81
The results of experiments on the dataset augmented with synthetic images generated by
inpainting are shown in Table 5.5. The highest accuracy for EfficientNet-B7 was achieved
by the model trained on the dataset augmented with 100% of synthetic data, whereas the
highest accuracy for MobileNetV3Large was achieved by the models trained on the datasets
augmented with 100% and 75% of synthetic data.
121
Table 5.5: Classification results for the EfficientNet-B7 and MobileNetV3Large models
trained on the dataset augmented with synthetic data generated by inpainting.
Model Dataset Precision Recall Accuracy
EfficientNet-B7
Real-world & 100% synth 84 79 84
Real-world & 75% synth 82 80 83
Real-world & 50% synth 88 68 82
Real-world & 25% synth 84 68 80
Baseline (real-world only) 83 77 83
MobileNetV3Large
Real-world & 100% synth 78 83 82
Real-world & 75% synth 79 80 82
Real-world & 50% synth 78 81 81
Real-world & 25% synth 80 75 80
Baseline (real-world only) 79 78 81
The results of the experiments on the dataset augmented with synthetic images generated
by interpolation are provided in Table 5.6. The highest accuracy for EfficientNet-B7 was
achieved by the model trained only on the real-world images, whereas the highest accuracy
for MobileNetV3Large was achieved by the model trained on the dataset augmented with
25% of synthetic data.
Table 5.6: Classification results for the EfficientNet-B7 and MobileNetV3Large models
trained on the dataset augmented with synthetic data generated by interpolation.
Model Dataset Precision Recall Accuracy
EfficientNet-B7
Real-world & 100% synth 87 66 80
Real-world & 75% synth 87 68 81
Real-world & 50% synth 85 67 80
Real-world & 25% synth 83 71 81
Baseline (real-world only) 83 77 83
MobileNetV3Large
Real-world & 100% synth 77 76 79
Real-world & 75% synth 75 81 79
Real-world & 50% synth 80 75 80
Real-world & 25% synth 80 79 82
Baseline (real-world only) 79 78 81
Finally, Table 5.7 summarises the results of experiments on the dataset augmented with
synthetic images generated after fine-tuning Stable Diffusion with LoRA. The results
indicate that augmentation with these synthetic images did not improve the accuracy of
EfficientNet-B7 or MobileNetV3Large, as none of the models trained on the augmented data
performed better than the baseline models.
122
Table 5.7: Classification results for the EfficientNet-B7 and MobileNetV3Large models
trained on the dataset augmented with synthetic data generated after fine-tuning Stable
Diffusion with LoRA.
Model Dataset Precision Recall Accuracy
EfficientNet-B7
Real-world & 100% synth 86 70 82
Real-world & 75% synth 82 75 78
Real-world & 50% synth 82 79 83
Real-world & 25% synth 84 78 81
Baseline (real-world only) 83 77 83
MobileNetV3Large
Real-world & 100% synth 75 76 76
Real-world & 75% synth 78 78 77
Real-world & 50% synth 78 82 80
Real-world & 25% synth 77 84 80
Baseline (real-world only) 79 78 81
5.5 Concluding remarks
One of the main practical obstacles to the further advancement of the promising OOC tech-
nology is the need for round-the-clock monitoring of tissue samples growing on chips, which
is currently done by humans. The goal of the research reported in this chapter was to de-
velop a CNN-based image classifier for assessing the quality of such samples, which would
enable the automation of the monitoring in the future. To achieve this goal, I conducted
two experimental studies: one on the initial OOC microscopy image dataset consisting of
822 images that represented five cell lines and were labelled as ‘good’, ‘acceptable’, or ‘bad’,
and another on the final dataset consisting of 3 072 images that represented six cell lines and
were labelled as ‘good’ or ‘bad’. In both studies, the same two hypotheses were tested:
• Hypothesis 1: A CNN-based classifier achieves better accuracy on the real-world
microscopy OOC image dataset than a putative ‘naive’ classifier.
• Hypothesis 2: The classification accuracy on the real-world microscopy OOC image
dataset improves when a CNN-based classifier is trained on the dataset augmented
with synthetic data generated with the Stable Diffusion model rather than solely on
the real-world image dataset.
The first hypothesis was confirmed in both studies, in particular:
• In the first study, the EfficientNet-B7 model achieved an accuracy of 72.9%, whereas
the accuracy of the putative ‘naive’ classifier on the initial dataset was estimated at
60.8%;
• in the second study, the EfficientNet-B7 model achieved an accuracy of 83%, and the
MobileNetV3Large model achieved an accuracy of 81%, whereas the accuracy of the
putative ‘naive’ classifier on the final dataset was estimated at 56%.
These results demonstrate the efficiency of the use of the selected CNN models for OOC
image classification. The difference in results achieved by EfficientNet-B7 in the first study
(on the initial dataset) compared to the second study (on the final dataset) is likely due to
123
two major factors: the increase in the size of the training dataset, and, perhaps even more
importantly, the shift from the more challenging three-class classification task to the simpler
binary classification task. Another noteworthy observation is that EfficientNet-B7 outper-
formed MobileNetV3Large on the final dataset, likely due to the higher learning capacity of
the larger network (66.7 vs 5.4 million parameters6). However, the gap between the accuracy
of the models is just 2%, which is substantially smaller than, e.g., the gap between the Top-1
accuracy of these models on the benchmark dataset ImageNet: 84.3% for EfficientNet-B7 vs
75.6% for MobileNetV3Large7. Therefore, for practical purposes, it may be more expedient
to use MobileNetV3Large, as its smaller size allows it to run in inference mode on a mobile
or edge device, which may be more suitable for a real-life OOC setup in lab conditions than
using a PC.
The second hypothesis was not confirmed in the first study, as augmenting the real-world
dataset with synthetic images led to a deterioration rather than an improvement in classifier
performance for both synthetic datasets, that is, for datasets with LoRA weight = 1.0 and
LoRA weight = 0.8. The negative impact of the synthetic data is particularly evident when
the model is trained solely on synthetic data, as these models exhibit the worst accuracy.
Remarkably, synthetic data generated with Stable Diffusion look similar to the original data
(see Figure 5.4) and the model trained solely on it converges very well during training.
However, due to the likely data distribution discrepancy between the real-world and synthetic
data, augmenting the former with the latter did not yield any benefits. In the second study,
the second hypothesis was confirmed for augmentation with synthetic data generated by
means of image-to-image translation and inpainting, but not for augmentation with synthetic
data generated using LoRA. In the case of synthetic data generated with interpolation,
a minor improvement on the augmented dataset was achieved with the MobileNetV3Large
model, but not with the EfficientNet-B7 model. These results indicate that the effectiveness of
augmenting real-world biomedical image datasets with synthetic data may vary substantially
depending on the specific method of data generation.
Finally, I would like to observe that similar to the study reported in Chapter 3, the
improvement observed in the performance of models that had benefited from training on
augmented datasets did not directly correlate with the amount of synthetic data used for
augmentation. For instance, in case of the image-to-image approach to the synthetic data
generation, the best-performing EfficientNet-B7 models were trained on the datasets aug-
mented with 50% and 25% of the available synthetic data, while the best-performing Mo-
bileNetV3Large model was trained on the dataset augmented with 25% of the available
synthetic data. For the inpainting approach, the best-performing EfficientNet-B7 model was
trained on the dataset augmented with all available (100%) synthetic data, whereas the Mo-
bileNetV3Large models that demonstrated the best accuracy were trained on the datasets
augmented with 100% and 75% of the available synthetic data. Therefore, consistent with
the conclusions of Chapter 3, I note that the relationship between the amount of the data
used for augmentation and the performance of the image understanding models trained on
the augmented datasets is far from straightforward.
6Cf. https://keras.io/api/applications/ and https://keras.io/api/applications/mobilenet/
#mobilenetv2-function; both accessed 2 September 2024.
7https://keras.io/api/applications/; accessed 2 September 2024.
124
Conclusion
This PhD thesis was concerned with the application of computer vision methods to three
major image understanding tasks: image classification, object detection, and semantic seg-
mentation. Specifically, I used computer vision methods for solving the following real-world
problems:
• classification of hand-washing movements in a clinical setting to automate the moni-
toring of compliance of medical personnel with hand hygiene standards (Chapter 2);
• semantic segmentation of street views to enhance perception modules of self-driving
cars (Chapter 3);
• detection of plastic bottles that can be picked up by a robotic arm to automate the
production line in a manufacturing facility (Chapter 4);
• classification of microscopy images to automate the monitoring of the growth of organs-
on-a-chip (Chapter 5).
As I outlined it in my overview of highlights of computer vision in Chapter 1, both methods
and software tools in this field have developed tremendously since its emergence in the 1960s,
and as a result of that growth, computers nowadays even outperform humans on some visual
recognition tasks. However, since computers still do not understand visual environment in
such a natural way as we, humans, do, each particular image understanding task still has to
be solved on a case-by-case basis by choosing an appropriate approach, whether it be classical
computer vision methods or the deep learning paradigm, deciding upon which specific model
to use, finding a dataset or collecting and labelling it on one’s own, preprocessing the data,
and training and evaluating the model. When I followed these steps in the studies that laid
the foundation of this thesis, my work was substantially facilitated by prior theoretical and
practical advancements in the fields of computer vision and machine learning. In particular,
instead of having to hand-engineer features for my models, as it was customary in the era
of the predominance of classical methods, I adopted CNNs as a single approach for solving
various image understanding tasks, from image classification to object detection to seman-
tic segmentation. While the specifics of the model architectures varied, the fundamental
advantage of CNNs, their ability to automatically learn hierarchical feature representations
directly from data, motivated me to come up with the central premise that convolutional
neural networks can successfully solve the image understanding tasks considered in this thesis.
Rather than having to implement and train CNNs from zero, I had at my disposal publicly
available CNN models implemented in deep learning frameworks: TensorFlow, Keras, and
PyTorch. Moreover, the models were available not as bare architectures but with weights
pretrained on large general-purpose datasets such as ImageNet and MS COCO, enabling
transfer learning and fine-tuning the models on much smaller datasets that I was working
with. For some of the tasks, I also was able to utilise suitable publicly available datasets: the
125
Cityscapes dataset was essential for my work on semantic segmentation of street views, while
the Kaggle dataset allowed me to establish a baseline for the assessment of performance of
CNN-based hand-washing movement classifiers. Furthermore, I leveraged open-source assets
for generating synthetic data that I could then use for augmenting real-world datasets: thus,
I generated street views with a high degree of photorealism, varying weather conditions, and
a high number of salient objects such as cars and pedestrians with the CARLA simulator,
and synthetic images of OOC cell tissue with complex textures using the Stable Diffusion
generative model.
However, despite the availability of assets for implementing and training CNNs, I still
had to overcome substantial challenges. One of these was the notoriously fluid8 state of
many state-of-the-art deep learning libraries: due to the very rapid pace and competitiveness
of deep learning research, many researchers prefer to publish new studies and release new
libraries rather than ensure the stability of the already released code and properly document
it. As a result, when working with cutting-edge libraries rather than with the more stable
mainstream deep learning frameworks per se, I often had to invest a lot of time in debugging
the code and looking for answers on various forums rather than in documentation, which
often was insufficiently detailed. Furthermore, since deep learning is an experimental field of
science, the efficiency of debugging and solutions suggested in non-published sources could
often be established only after conducting a time-consuming experiment with the model, and
many times, the outcome was that the debugging or a search for solution had to be continued.
Another, more fundamental challenge was that in general, decisions regarding the archi-
tecture and hyperparameters of deep learning models that have to be made when applying
CNNs to real-world problems remain largely heuristic. While many such choices are moti-
vated – for instance, since it is known that a larger model such as EfficientNet-B7 is likely
to generalise better than a smaller one such as EfficientNet-B0, one may choose the former
over the latter when classifying a complex dataset – many other choices do reside in a gray
area. Was unfreezing the top 20 layers of EfficientNet-B7 and the top 15 layers of Mo-
bileNetV3Large during the final training stage on the final OOC dataset (see Section 5.4.3)
the optimal approach, or would it be better to unfreeze fewer layers, or, just the opposite,
unfreeze the whole models!? What about having 128 neurons in the Dense layer of the base-
line models trained on the METC and PSKUS datasets: was that number just right, or too
small, or too high!? Could it be the case that arranging the same number of neurons in
several layers rather than a single one would lead to a better accuracy? This thesis does not
provide answers to these questions or a myriad of other possible similar questions, since, as
it is usually the case with the applications of deep learning to complex real-world datasets, a
comprehensive grid search for the best model parameters was not feasible due to the limited
computational resources, long training times of models, and high number of hyperparameters
and features of model architectures.
Finally, yet another major challenge was the issue of data acquisition and labelling. In
particular, since there were not any publicly available datasets with labelled recordings of
hand washing, it was necessary to collect and label such a dataset for developing a hand-
washing movement classifier, and since an OOC image classifier that I needed to develop was
supposed to process rather specific images generated with particular protocols on a particular
OOC setup used in the AImOOC project, it was necessary to collect and process a dataset
with such images as well. While I did not personally conduct data acquisition and labelling
for either dataset, I had to translate the requirements of the deep learning pipelines to the
8I am afraid that some more orthodox Computer Science researchers would actually use the word ‘sloppy’
here.
126
experts – epidemiologists and biomedical researchers, respectively – and make sure that these
requirements were eventually met. Some of the main challenges of the data acquisition for the
work reported in this thesis were inter-annotator agreement (actually, often disagreement)
when labelling hand-washing videos and a rather small size of the initial OOC cell image
dataset. While the latter issue was largely resolved - first by employing cross-validation of
models on several folds, and then by collecting and utilising a larger dataset - the labelling
of the PSKUS dataset, the largest real-world dataset of hand-washing videos collected in
the studies reported in Chapter 2, can still be considered rather noisy. In particular, since
there was a ≈ 91% agreement between annotators as to which frames should be labelled
as is washing, and a further ≈ 90% agreement on the hand-washing movement code, it is
obvious that training and evaluating classifiers on such noisy data is very likely to affect their
performance negatively. In the retrospective, it appears to me that the data labelling process
should have been monitored more closely so that it would be possible to identify the problem
of a poor inter-annotator agreement earlier and mitigate it. Another possible solution was
be to use additional lightning and film videos with cameras with higher resolution.
Despite the challenges that I outlined above, I consider the results of the studies that laid
the foundation of this thesis to be rather successful, in particular:
• In the research on hand-washing movement classification, excellent results - an F1 score
of 96% - were achieved on the Kaggle dataset and satisfactory results - an F1 score of
64% - were achieved on the METC dataset. While any of the models failed to achieve
good performance on the most complex and noisy dataset, the PSKUS dataset, their
failure highlighted the important yet often neglected problem of the translation of
methodology from lab conditions to the complex real-world conditions. Furthermore,
despite noise in the labelling of the PSKUS dataset, both it and METC, another dataset
collected and published in open access in the course of research reported in Chapter 2,
are valuable assets for further studies on hand-washing movement classification.
• In the research on improving the accuracy of semantic segmentation of street views,
the mIoU of MobileNetV2 trained on the CCM-50 dataset (i.e., the dataset augmented
with 50% of available synthetic data) was higher by 12% than that of the same model
trained only on the real-world Cityscapes images, and the mIoU of Xception-65 trained
on the CCM-25 dataset (i.e., the dataset augmented with 25% of available synthetic
data) was higher by 7% than that that of the same model trained only on the real-world
Cityscapes images. While it was not possible to predict what amount of synthetic data
for augmentation would yield the best result, the fact that all models trained on the
augmented datasets outperformed their counterparts trained solely on the real-world
data demonstrated the usefulness of synthetic data for the enhancement of perceptual
modules of self-driving cars. Furthermore, it is worth noting that the synthetic data
was generated in a relatively simple and straightforward manner by running simulations
in the open-source CARLA simulator, which demonstrated that very useful synthetic
data do not have to be challenging to acquire.
• In the research on detecting graspable bottles, the state-of-the-art object detector
YOLOv5 demonstrated that it can efficiently – the best model achieved a mAP of
77.7% with a threshold of 0.5 – detect bottles with the visibility above a certain thresh-
old in a pile of similarly looking objects. This study also contributed to the topics of
the use of synthetic data for training models and enhancing the photorealism of syn-
thetic data, since the best-performing object detectors were trained on synthetic data
enhanced with the CycleGAN model.
127
• In the research on OOC microscopy image classification, the best EfficientNet-B7 model
achieved an accuracy of 85%, while the best MobileNetV3Large model achieved an
accuracy of 82%. Since these best-performing models were trained on the dataset
augmented with 100% and 25% of the available synthetic data generated by image-to-
image translation with Stable Diffusion, these studies also contributed to the emerging
research of generating biomedical imagery for training DNNs with large generative
models.
As follows from the above recap, the results demonstrated that the goal of the thesis –
to provide efficient solutions for applied image understanding tasks – was achieved
for all tasks except the classification experiments on the PSKUS dataset. The
scientific novelty of the studies reported in the thesis is, inter alia, due to the fact that in
those studies, CNN models were successfully applied to the novel datasets: either real-world,
or synthetic, or both. The findings of the studies considered jointly also make me to propose
the following thesis statements for the defence:
• Thesis statement one: In applications of CNNs to real-world image under-
standing tasks, data availability and quality present greater challenges than
model selection and customisation.
This thesis statement is grounded in the studies that I presented in Chapters 2, 3, 4,
and 5. Additionally, it is supported by observations in the literature (see Section 1.4.1),
which highlight that while the AI community has heavily focused on model development
and improvement, making DNN models more accessible, the availability and quality of
data remain the primary enabling factors for effectively utilising DNNs.
• Thesis statement two: CNNs that perform well when trained and evaluated
on datasets acquired in laboratory conditions may struggle to achieve simi-
lar success when trained and evaluated on more complex real-world data.
This thesis statement is grounded in the research on hand-washing movement classifi-
cation that I reported in Chapter 2. A major finding of the cross-dataset study [176]
reported there was that CNNs that achieved high classification accuracy on the Kag-
gle dataset, which was acquired in simplified lab conditions, demonstrated substantially
lower accuracy on the more complex METC dataset and failed to generalise on the even
more challenging real-world PSKUS dataset9. There are compelling reasons to believe
that the scope of this observation extends beyond this particular study, especially given
the current focus in a substantial part of ML research on incremental improvements,
often at the expense of generalisability (see Section 1.4.1).
• Thesis statement three: While state-of-the-art CNN-based image classifiers
and object detectors with a larger number of parameters typically demon-
strate higher accuracy on benchmark datasets than their counterparts with
a smaller number of parameters, this accuracy gap narrows or even vanishes
when these models are trained and evaluated on smaller, more complex real-
world datasets.
This thesis statement is grounded in the results of experiments in Chapter 4, in which
I used the YOLOv5 object detector to detect graspable bottles, and in Chapter 5,
9Note that while the cross-dataset study I refer to here included experiments with training a model on
one dataset and evaluating it on another dataset – e.g., a model was trained on the Kaggle dataset and
evaluated on the METC dataset – my observation here is primarily about training and evaluating a model
on the same dataset.
128
in which I used the EfficientNet-B7 and MobileNetV3Large models for classification
on the final dataset of OOC microscopy images. Experiments in Chapter 4 were con-
ducted with different sizes of the YOLOv5 architecture: Small (7.2 million parameters),
Medium (21.2 million parameters), and Extra Large (86.7 million parameters). When
the authors of these models benchmarked them on the MS COCO dataset, the accuracy
of the object detection matched their size: the Small model achieved an mAP of 56.8%,
the Medium - of 64.1%, and the Extra Large - of 68.9%10. However, when trained on
the best dataset for bin-picking task, Augmented noise 1024 × 768, the Small model
achieved an mAP of 77.7%, the Medium model - 75.3%, and the Extra Large model
- 76.1%. These results indicate that in these experiments, the advantage of larger
models vanished. As for experiments in Chapter 5, EfficientNet-B7 is substantially
larger than MobileNetV3Large – 66.7 vs 5.4 million parameters, which is also reflected
in the performance on the benchmarks dataset ImageNet: 84.3% for EfficientNet-B7
vs 75.6% for MobileNetV3Large11. However, that gap decreased to just 2% in favour
of EfficientNet-B7 when these models were trained and evaluated on the OOC image
dataset.
• Thesis statement four: While augmenting real-world datasets with photore-
alistic synthetic images is an efficient way to improve the accuracy of CNNs
trained on such data, increasing the amount of synthetic data does not di-
rectly correlate with improved accuracy on image understanding tasks.
This thesis statement is grounded in the research in Chapters 3 and 5. In Chapter 3, I
used various amounts of synthetic data – 25%, 50% and 100% of the available synthetic
images – for augmenting Cityscapes, the real-world dataset of street views, to im-
prove the accuracy of semantic segmentation. While the accuracy of the MobileNetV2
and Xception-65 models trained on augmented datasets improved compared to models
trained only on the real-world images, the accuracy improvement did not correlate with
the amount of synthetic data used for augmentation: the best-performing MobileNetV2
model was the one trained on the dataset augmented with 50% of synthetic data,
whereas the best-performing Xception-65 model was the one trained on the dataset
augmented with 25% of synthetic data. In Chapter 5, the most efficient approaches to
generating synthetic data for augmenting real-world OOC image datasets for training
the EfficientNet-B7 and MobileNetV3Large classifiers were image-to-image translation
and inpainting. For image-to-image translation, the best-performing EfficientNet-B7
model was trained on the dataset augmented with 100% of synthetic data, while the
best-performing MobileNetV3Large model was trained on the dataset augmented with
25% of synthetic data; for inpainting, the respective datasets were the ones augmented
with 100% of synthetic data and with 75 and 100% of synthetic data. These results
suggest that there is no straightforward correlation between the amount of the data
used for augmentation and the accuracy of a model trained on the augmented dataset.
While the results of the studies were published in six scholarly articles indexed in Elsevier
Scopus and/or Web of Science database (with at least two more publications forthcoming)
and two scholarly publications not indexed in these databases as well as presented at four
conferences, there is still ample room for further work. The primary objective for the near
future is to apply the insights and knowledge gained during the work on this thesis to solve
10https://github.com/ultralytics/yolov5. Accessed 15 September 2024.
11Cf. https://keras.io/api/applications/ and https://keras.io/api/applications/mobilenet/
#mobilenetv2-function. Both accessed 15 September 2024.
129
image understanding tasks in other ongoing research projects. Thus, the object detection
approaches used in the study in Chapter 4 and the methods for generating synthetic data
with large generative models to augment real-world image dataset developed and validated
in the research reported in Chapter 5 are currently being applied and further improved in
the project Holographic microscopy- and artificial intelligence-based digital pathology for the
next generation of cytology in veterinary medicine – VetCyto to enhance the precision of
microscopy-based veterinary diagnostics. The major goal for the more distant yet, I believe,
still foreseeable future is to contribute to the development of models truly capable of image
understanding.
130
Bibliography
[1] N. J. Nilsson, The Quest for Artificial Intelligence. Cambridge University Press, 2009.
[2] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,”
The bulletin of mathematical biophysics, vol. 5, pp. 115–133, 1943.
[3] J. Achiam et al., “GPT-4 technical report,” arXiv:2303.08774, 2023.
[4] D. Hassabis, “Artificial Intelligence: Chess match of the century,” Nature, vol. 544, no. 7651,
pp. 413–414, 2017.
[5] R. Szeliski, Computer Vision: Algorithms and Applications. Springer Nature, 2022.
[6] M. A. Boden, Mind as Machine: A History of Cognitive Science. Oxford University Press,
2008.
[7] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep
neural networks,” International journal of multimedia information retrieval, vol. 7, pp. 87–93,
2018.
[8] Y. Amit, P. Felzenszwalb, and R. Girshick, “Object detection,” in Computer Vision: A
Reference Guide, pp. 1–9, Springer, 2020.
[9] S. Nikolenko, Synthetic Data for Deep Learning. Springer, 2021.
[10] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep reinforcement
learning for robotics: A survey,” in 2020 IEEE symposium series on computational intelligence
(SSCI), pp. 737–744, IEEE, 2020.
[11] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban
driving simulator,” in Conference on robot learning, pp. 1–16, PMLR, 2017.
[12] I. Goodfellow et al., “Generative adversarial nets,” Advances in neural information processing
systems, vol. 27, 2014.
[13] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image syn-
thesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 10684–10695, 2022.
[14] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
[15] N. O’Mahony et al., “Deep learning vs. traditional computer vision,” in Advances in Computer
Vision: Proceedings of the 2019 Computer Vision Conference (CVC), Volume 1, pp. 128–144,
Springer, 2020.
[16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297,
1995.
[17] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on infor-
mation theory, vol. 13, no. 1, pp. 21–27, 1967.
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolu-
tional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–
1105, 2012.
[19] J. Deng et al., “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE confer-
ence on computer vision and pattern recognition, pp. 248–255, IEEE, 2009.
131
[20] B. Ja¨hne, H. Haussecker, and P. Geissler, Handbook of Computer Vision and Applications.
San Diego: Academic Press, 1999.
[21] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge
University Press, 2003.
[22] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach. Prentice Hall Professional
Technical Reference, 2002.
[23] X. Feng, Y. Jiang, X. Yang, M. Du, and X. Li, “Computer vision algorithms and hardware
implementations: A survey,” Integration, vol. 69, pp. 309–320, 2019.
[24] S. A. Papert, “The summer vision project,” tech. rep., MIT, 1966.
[25] B. Ja¨hne, ed., Computer Vision and Applications: A Guide for Students and Practitioners.
Elsevier, 2000.
[26] The Encyclopedia Britannica Editors, “Computer vision.” [Online], 2024. Available: https:
//www.britannica.com/technology/computer-vision. Accessed: 20 June 2024.
[27] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application
to stereo vision,” in IJCAI’81: 7th international joint conference on Artificial intelligence,
vol. 2, pp. 674–679, 1981.
[28] J.-Y. Bouguet, “Pyramidal implementation of the affine Lucas Kanade feature tracker de-
scription of the algorithm,” tech. rep., Intel corporation, 2001.
[29] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on
systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
[30] S. Beucher and F. Meyer, “The morphological approach to segmentation: The watershed
transformation,” in Mathematical morphology in image processing, pp. 433–481, CRC Press,
2018.
[31] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern
analysis and machine intelligence, no. 6, pp. 679–698, 1986.
[32] I. Pitas, Digital Image processing Algorithms and Applications. John Wiley & Sons, 2000.
[33] S. Aditya, Y. Yang, and C. Baral, “Integrating knowledge and reasoning in image understand-
ing,” arXiv:1906.09954, 2019.
[34] D. Sarvamangala and R. V. Kulkarni, “Convolutional neural networks in medical image un-
derstanding: A survey,” Evolutionary intelligence, vol. 15, no. 1, pp. 1–22, 2022.
[35] M. A. Al-Malla, A. Jafar, and N. Ghneim, “Image captioning model using attention and
object features to mimic human image understanding,” Journal of Big Data, vol. 9, no. 1,
pp. 1–16, 2022.
[36] Y.-J. Zhang and Y.-J. Zhang, “Image engineering,” in Handbook of Image Engineering, pp. 55–
83, Springer, 2021.
[37] D. Ortego, J. C. SanMiguel, and J. M. Martinez, “Long-term stationary object detection
based on spatio-temporal change detection,” IEEE Signal Processing Letters, vol. 22, no. 12,
pp. 2368–2372, 2015.
[38] S. Fazekas, B. K. Budai, R. Stollmayer, P. N. Kaposi, and V. Be´rczi, “Artificial intelligence
and neural networks in radiology–basics that all radiology residents should know,” Imaging,
vol. 14, no. 2, pp. 73–81, 2022.
[39] S. Z. Li, Markov Random Field Modeling in Computer Vision. Springer Science & Business
Media, 2012.
[40] Y. LeCun, C. Cortes, and C. Burges, “The MNIST database of handwritten digits.” [Online].
Available: http://yann.lecun.com/exdb/mnist/. Accessed: 20 June 2024.
[41] W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification: A
comprehensive review,” Neural computation, vol. 29, no. 9, pp. 2352–2449, 2017.
132
[42] D. C. Cires¸an, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep, big, simple neural
nets for handwritten digit recognition,” Neural computation, vol. 22, no. 12, pp. 3207–3220,
2010.
[43] L. Liu et al., “Deep learning for generic object detection: A survey,” International journal of
computer vision, vol. 128, pp. 261–318, 2020.
[44] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,”
Proceedings of the IEEE, 2023.
[45] Y. Zheng, O. Andrienko, Y. Zhao, M. Park, and T. Pham, “DPPD: Deformable polar polygon
object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 78–87, 2023.
[46] R. Padilla, S. L. Netto, and E. A. Da Silva, “A survey on performance metrics for object-
detection algorithms,” in 2020 International conference on systems, signals and image pro-
cessing (IWSSIP), pp. 237–242, IEEE, 2020.
[47] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-
time object detection,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 779–788, 2016.
[48] R. Padilla, W. L. Passos, T. L. Dias, S. L. Netto, and E. A. Da Silva, “A comparative analysis
of object detection metrics with a companion open-source toolkit,” Electronics, vol. 10, no. 3,
p. 279, 2021.
[49] Baeldung, “Intersection over union for object detection.” [Online], 2024. Available: https:
//www.baeldung.com/cs/object-detection-intersection-vs-union. Accessed 23 June
2024.
[50] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image
segmentation using deep learning: A survey,” IEEE transactions on pattern analysis and
machine intelligence, vol. 44, no. 7, pp. 3523–3542, 2021.
[51] S. Hao, Y. Zhou, and Y. Guo, “A brief survey on semantic segmentation with deep learning,”
Neurocomputing, vol. 406, pp. 302–321, 2020.
[52] J. Wang, Y. Ma, L. Zhang, R. X. Gao, and D. Wu, “Deep learning for smart manufacturing:
Methods and applications,” Journal of manufacturing systems, vol. 48, pp. 144–156, 2018.
[53] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags
of keypoints,” in Workshop on statistical learning in computer vision, ECCV, vol. 1, Prague,
2004.
[54] G. Csurka, C. R. Dance, F. Perronnin, and J. Willamowski, “Generic visual categorization
using weak geometry,” in Toward Category-Level Object Recognition, pp. 207–224, Springer,
2006.
[55] L. Bai, Y. Li, M. Cen, and F. Hu, “3D instance segmentation and object detection framework
based on the fusion of LIDAR remote sensing and optical image sensing,” Remote Sensing,
vol. 13, no. 16, p. 3288, 2021.
[56] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach. Boston: Pearson, 2012.
[57] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The PASCAL
Visual Object Classes (VOC) challenge,” International journal of computer vision, vol. 88,
no. 2, pp. 303–338, 2010.
[58] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Beyond sliding windows: Object localiza-
tion by efficient subwindow search,” in 2008 IEEE conference on computer vision and pattern
recognition, pp. 1–8, IEEE, 2008.
[59] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders, “Segmentation as
selective search for object recognition,” in 2011 International conference on computer vision,
pp. 1879–1886, IEEE, 2011.
[60] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer
vision, vol. 57, pp. 137–154, 2004.
133
[61] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE
computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1,
pp. 886–893, IEEE, 2005.
[62] D. Mumford and J. Shah, “Boundary detection by minimizing functionals,” in IEEE Confer-
ence on computer vision and pattern recognition, vol. 17, pp. 137–154, San Francisco, 1985.
[63] J. du Buf, M. Kardan, and M. Spann, “Texture feature performance of image segmentation,”
Pattern Recognition, vol. 23, pp. 291–309, 1990.
[64] A. Treme´au and N. Borel, “A region growing and merging algorithm to color segmentation,”
Pattern Recognition, vol. 30, pp. 1191–1203, Jul 1997.
[65] E. Borenstein and S. Ullman, “Class-specific, top-down segmentation,” in Computer Vi-
sion—ECCV 2002: 7th European Conference on Computer Vision Copenhagen, Denmark,
May 28–31, 2002 Proceedings, Part II 7, pp. 109–122, Springer, 2002.
[66] F. Schroff, A. Criminisi, and A. Zisserman, “Single-histogram class models for image segmen-
tation,” in Computer Vision, Graphics and Image Processing: 5th Indian Conference, ICVGIP
2006, Madurai, India, December 13-16, 2006. Proceedings, pp. 82–93, Springer, 2006.
[67] H. Yu et al., “Methods and datasets on semantic segmentation: A review,” Neurocomputing,
vol. 304, pp. 82–103, 2018.
[68] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “TextonBoost : Joint appearance, shape and
context modeling for multi-class object recognition and segmentation,” in Computer Vision–
ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006.
Proceedings, Part I 9, pp. 1–15, Springer, 2006.
[69] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “TextonBoost for image understanding:
Multi-class object recognition and segmentation by jointly modeling texture, layout, and
context,” International journal of computer vision, vol. 81, pp. 2–23, 2009.
[70] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models
for segmenting and labeling sequence data,” in International Conference on Machine Learning
(ICML), 2001.
[71] P. Sturgess, K. Alahari, L. Ladicky, and P. H. Torr, “Combining appearance and structure
from motion features for road scene understanding,” in BMVC-British Machine Vision Con-
ference, BMVA, 2009.
[72] O. Russakovsky et al., “ImageNet large scale visual recognition challenge,” International
journal of computer vision, vol. 115, pp. 211–252, 2015.
[73] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object
detection and semantic segmentation,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 580–587, 2014.
[74] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester, “Discriminatively trained deformable
part models, release 5.” [Online], 2012. Available: http://www.rossgirshick.info/
latent/. Accessed 23 June 2024.
[75] A. Plaksyvyi, M. Skublewska-Paszkowska, and P. Powroznik, “A comparative analysis of
image segmentation using classical and deep learning approach,” Advances in Science and
Technology. Research Journal, vol. 17, no. 6, 2023.
[76] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778,
2016.
[77] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” in 12th USENIX
symposium on operating systems design and implementation, pp. 265–283, 2016.
[78] F. Chollet, “Keras.” [Online], 2015. Available: https://keras.io. Accessed 23 June 2024.
[79] A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,”
Advances in neural information processing systems, vol. 32, 2019.
134
[80] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444,
2015.
[81] S. Pouyanfar et al., “A survey on deep learning: Algorithms, techniques, and applications,”
ACM Computing Surveys (CSUR), vol. 51, no. 5, pp. 1–36, 2018.
[82] M. Z. Alom et al., “A state-of-the-art survey on deep learning theory and architectures,”
Electronics, vol. 8, no. 3, p. 292, 2019.
[83] J. Howard and S. Gugger, Deep Learning for Coders with fastai and PyTorch. O’Reilly Media,
2020.
[84] F. Chollet, Deep learning with Python. Simon and Schuster, 2021.
[85] A. Glassner, Deep learning: A visual approach. No Starch Press, 2021.
[86] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning. Cambridge
University Press, 2023.
[87] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organiza-
tion in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
[88] J. A. Yacim and D. G. B. Boshoff, “Impact of artificial neural networks training algorithms
on accurate prediction of property values,” Journal of Real Estate Research, vol. 40, no. 3,
pp. 375–418, 2018.
[89] K. Fukushima, “Visual feature extraction by a multilayered network of analog threshold ele-
ments,” IEEE Transactions on Systems Science and Cybernetics, vol. 5, no. 4, pp. 322–333,
1969.
[90] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in
Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814,
2010.
[91] G. S. Bhumbra, “Deep learning improved by biological activation functions,”
arXiv:1804.11237, 2018.
[92] D. E. Rumelhart, G. E. Hinton, and R. Williams, “Learning internal representations by error
propagation,” tech. rep., Institute for Cognitive Science, University of California, San Diego
La Jolla, 1985.
[93] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-
propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
[94] N. Nedjah, I. Santos, and L. de Macedo Mourelle, “Sentiment analysis using convolutional
neural network via word embeddings,” Evolutionary Intelligence, vol. 15, no. 4, pp. 2295–2319,
2022.
[95] M. Nielsen, “Neural networks and deep learning.” [Online], 2015. Available: http://
neuralnetworksanddeeplearning.com/. Accessed 23 June 2024.
[96] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel image dataset for benchmarking
machine learning algorithms,” arXiv:1708.07747, 2017.
[97] S. Booth, Y. Zhou, A. Shah, and J. Shah, “Bayes-probe: Distribution-guided sampling for
prediction level sets,” arXiv:2002.10248, 2020.
[98] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep learning for visual
understanding: A review,” Neurocomputing, vol. 187, pp. 27–48, 2016.
[99] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional archi-
tecture in the cat’s visual cortex,” The Journal of physiology, vol. 160, no. 1, p. 106, 1962.
[100] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of
pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4,
pp. 193–202, 1980.
[101] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to docu-
ment recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
135
[102] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural
computation, vol. 1, no. 4, pp. 541–551, 1989.
[103] NVIDIA Developer website, “Convolution.” [Online], 2024. Available: https://developer.
nvidia.com/discover/convolution. Accessed 5 January 2024.
[104] M. Yani, B. Irawan, and C. Setiningsih, “Application of transfer learning using convolutional
neural network method for early detection of Terry’s nail,” in Journal of Physics: Conference
Series, vol. 1201, IOP Publishing, 2019.
[105] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 4510–4520, 2018.
[106] A. Howard et al., “Searching for MobileNetV3,” in Proceedings of the IEEE/CVF international
conference on computer vision, pp. 1314–1324, 2019.
[107] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,”
in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
[108] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision
applications,” arXiv:1704.04861, 2017.
[109] L. Sifre, Rigid-motion Scattering for Image Classification. PhD thesis, E´cole Polytechnique,
2014.
[110] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reduc-
ing internal covariate shift,” in International conference on machine learning, pp. 448–456,
PMLR, 2015.
[111] H. Cai, J. Lin, and S. Han, “Efficient methods for deep learning,” in Advanced Methods and
Deep Learning in Computer Vision, pp. 159–190, Elsevier, 2022.
[112] M. Tan et al., “MnasNet: Platform-aware neural architecture search for mobile,” in Proceed-
ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2820–2828,
2019.
[113] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
[114] T.-J. Yang et al., “Netadapt: Platform-aware neural network adaptation for mobile applica-
tions,” in Proceedings of the European conference on computer vision (ECCV), pp. 285–300,
2018.
[115] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,”
arXiv:1710.05941, 2017.
[116] Y. Huang et al., “GPipe: Efficient training of giant neural networks using pipeline paral-
lelism,” Advances in neural information processing systems, vol. 32, 2019.
[117] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258, 2017.
[118] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for
semantic image segmentation,” arXiv:1706.05587, 2017.
[119] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 1–9, 2015.
[120] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception
architecture for computer vision,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2818–2826, 2016.
[121] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, Inception-ResNet and the
impact of residual connections on learning,” in Proceedings of the AAAI conference on artifi-
cial intelligence, vol. 31, 2017.
136
[122] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image
segmentation with deep convolutional nets and fully connected CRFs,” arXiv:1412.7062, 2014.
[123] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully connected
CRFs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4,
pp. 834–848, 2017.
[124] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous
separable convolution for semantic image segmentation,” in Proceedings of the European con-
ference on computer vision (ECCV), pp. 801–818, 2018.
[125] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv:1409.1556, 2014.
[126] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmen-
tation,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 3431–3440, 2015.
[127] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian, “A real-time algo-
rithm for signal analysis with the help of the wavelet transform,” inWavelets: Time-Frequency
Methods and Phase Space Proceedings of the International Conference, Marseille, France, De-
cember 14–18, 1987, pp. 286–297, Springer, 1990.
[128] R. Mottaghi et al., “The role of context for object detection and semantic segmentation in
the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 891–898, 2014.
[129] X. Chen et al., “Detect what you can: Detecting and representing objects using holistic
models and body parts,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 1971–1978, 2014.
[130] M. Cordts et al., “The Cityscapes dataset for semantic urban scene understanding,” in Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223,
2016.
[131] V. Singh, “DeepLabv3 & DeepLabv3+ the ultimate PyTorch guide.” [Online], 2022. Available:
https://learnopencv.com/deeplabv3-ultimate-guide/. Accessed 7 March 2024.
[132] R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE international conference on computer
vision, pp. 1440–1448, 2015.
[133] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection
with region proposal networks,” Advances in neural information processing systems, vol. 28,
2015.
[134] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE
international conference on computer vision, pp. 2961–2969, 2017.
[135] G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8.” [Online], 2023. Available: https:
//github.com/ultralytics. Accessed 7 August 2024.
[136] S. Aharon et al., “Super-Gradients.” [Online], 2021. Available: https://github.com/
Deci-AI/super-gradients. Accessed 29 June 2024.
[137] M. Hussain, “YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward
digital manufacturing and industrial defect detection,” Machines, vol. 11, no. 7, p. 677, 2023.
[138] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network
acoustic models,” in Proc. ICML, vol. 30, p. 3, Atlanta, GA, 2013.
[139] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 7263–7271, 2017.
[140] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv:1804.02767,
2018.
137
[141] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Computer Vision–ECCV
2014. ECCV 2014. Lecture Notes in Computer Science, vol. 8693, pp. 740–755, Springer,
2014.
[142] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dolla´r, “Focal loss for dense object detection,”
in Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
[143] W. Liu et al., “SSD: Single shot MultiBox detector,” in Computer Vision–ECCV 2016. ECCV
2016. Lecture Notes in Computer Science., vol. 9905, pp. 21–37, Springer, 2016.
[144] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD: Deconvolutional single shot
detector,” arXiv:1701.06659, 2017.
[145] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of
object detection,” arXiv:2004.10934, 2020.
[146] S. Xie, R. Girshick, P. Dolla´r, Z. Tu, and K. He, “Aggregated residual transformations for
deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 1492–1500, 2017.
[147] C.-Y. Wang et al., “CSPNet: A new backbone that can enhance learning capability of CNN,”
in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition work-
shops, pp. 390–391, 2020.
[148] T.-Y. Lin, P. Dolla´r, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid
networks for object detection,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 2117–2125, 2017.
[149] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8759–
8768, 2018.
[150] S. Yun et al., “CutMix: Regularization strategy to train strong classifiers with localizable
features,” in Proceedings of the IEEE/CVF international conference on computer vision,
pp. 6023–6032, 2019.
[151] Z. Zheng et al., “Distance-IoU loss: Faster and better learning for bounding box regression,”
in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 12993–13000,
2020.
[152] D. Misra, “Mish: A self regularized non-monotonic activation function,” arXiv:1908.08681,
2019.
[153] G. Jocher, “YOLOv5 by Ultralytics.” [Online], 2020. Available: https://github.com/
ultralytics/yolov5. Accessed 7 August 2024.
[154] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv:1606.08415,
2016.
[155] A. Buslaev et al., “Albumentations: Fast and flexible image augmentations,” Information,
vol. 11, no. 2, p. 125, 2020.
[156] J. Terven, D.-M. Co´rdova-Esparza, and J.-A. Romero-Gonza´lez, “A comprehensive review
of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS,”
Machine Learning and Knowledge Extraction, vol. 5, no. 4, pp. 1680–1716, 2023.
[157] Midjourney research lab, “Midjourney.” [Online], 2022. Available: https://www.
midjourney.com. Accessed 13 March 2024.
[158] A. Ramesh et al., “Zero-shot text-to-image generation,” in International Conference on Ma-
chine Learning, pp. 8821–8831, PMLR, 2021.
[159] E. Denton, A. Hanna, R. Amironesei, A. Smart, and H. Nicole, “On the genealogy of machine
learning datasets: A critical history of ImageNet,” Big Data & Society, vol. 8, no. 2, pp. 1–14,
2021.
[160] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Uni-
versity of Toronto, 2009.
138
[161] S. Priya and R. A. Uthra, “Deep learning framework for handling concept drift and class
imbalanced complex decision-making on streaming data,” Complex & Intelligent Systems,
vol. 9, no. 4, pp. 3499–3515, 2023.
[162] J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,”
Journal of Big Data, vol. 6, no. 1, pp. 1–54, 2019.
[163] F. Zhuang et al., “A comprehensive survey on transfer learning,” Proceedings of the IEEE,
vol. 109, no. 1, pp. 43–76, 2020.
[164] S. Niu, Y. Liu, J. Wang, and H. Song, “A decade survey of transfer learning (2010–2020),”
IEEE Transactions on Artificial Intelligence, vol. 1, no. 2, pp. 151–166, 2020.
[165] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big
data, vol. 3, pp. 1–40, 2016.
[166] Chollet, Franc¸ois, “Complete guide to transfer learning & fine-tuning in Keras.” [Online],
2020. Available: https://keras.io/guides/transfer_learning/. Accessed 1 May 2024.
[167] B.-X. Wu, C.-G. Yang, and J.-P. Zhong, “Research on transfer learning of vision-based gesture
recognition,” International Journal of Automation and Computing, vol. 18, no. 3, pp. 422–431,
2021.
[168] S. Chiba and H. Sasaoka, “Basic study for transfer learning for autonomous driving in car
race of model car,” in 2021 6th International Conference on Business and Industrial Research
(ICBIR), pp. 138–141, IEEE, 2021.
[169] J. Hua, L. Zeng, G. Li, and Z. Ju, “Learning for a robot: Deep reinforcement learning,
imitation learning, transfer learning,” Sensors, vol. 21, no. 4, p. 1278, 2021.
[170] H. E. Kim, A. Cosa-Linan, N. Santhanam, M. Jannesari, M. E. Maros, and T. Ganslandt,
“Transfer learning for medical image classification: A literature review,” BMC Medical Imag-
ing, vol. 22, no. 1, p. 69, 2022.
[171] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learn-
ing,” Journal of Big Data, vol. 6, no. 1, pp. 1–48, 2019.
[172] L. Alzubaidi et al., “Review of deep learning: Concepts, CNN architectures, challenges, ap-
plications, future directions,” Journal of Big Data, vol. 8, pp. 1–74, 2021.
[173] M. Ivanovs, R. Kadikis, M. Lulla, A. Rutkovskis, and A. Elsts, “Automated quality assessment
of hand washing using deep learning,” arXiv:2011.11383, 2020.
[174] M. Lulla et al., “Hand-washing video dataset annotated according to the World Health Or-
ganization’s hand-washing guidelines,” Data, vol. 6, no. 4, p. 38, 2021.
[175] O. Zemlanuhina et al., “Influence of different types of real-time feedback on hand washing
quality assessed with neural networks/simulated neural networks,” in SHS Web of Confer-
ences, vol. 131, pp. 1–13, EDP Sciences, 2022.
[176] A. Elsts, M. Ivanovs, R. Kadikis, and O. Sabelnikovs, “CNN for hand washing movement
classification: What matters more–the approach or the dataset?,” in 2022 Eleventh Inter-
national Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6,
IEEE, 2022.
[177] S. L. Barnes, D. J. Morgan, A. D. Harris, P. C. Carling, and K. A. Thom, “Preventing the
transmission of multidrug-resistant organisms: Modeling the relative importance of hand hy-
giene and environmental cleaning interventions,” Infection Control & Hospital Epidemiology,
vol. 35, no. 9, pp. 1156–1162, 2014.
[178] European Centre for Disease Prevention and Control, “Assessing the health burden of in-
fections with antibiotic-resistant bacteria in the EU/EEA, 2016–2020,” tech. rep., ECDC,
Stockholm, 2022.
[179] C. J. Murray et al., “Global burden of bacterial antimicrobial resistance in 2019: A systematic
analysis,” The Lancet, vol. 399, no. 10325, pp. 629–655, 2022.
139
[180] B. Allegranzi and D. Pittet, “Role of hand hygiene in healthcare-associated infection preven-
tion,” Journal of Hospital Infection, vol. 73, no. 4, pp. 305–315, 2009.
[181] C. Suetens et al., “Prevalence of healthcare-associated infections, estimated incidence and
composite antimicrobial resistance index in acute care hospitals and long-term care facilities:
Results from two European point prevalence surveys, 2016 to 2017,” Eurosurveillance, vol. 23,
no. 46, pp. 1–17, 2018.
[182] E. Goldman, “Exaggerated risk of transmission of COVID-19 by fomites,” The Lancet Infec-
tious Diseases, vol. 20, no. 8, pp. 892–893, 2020.
[183] K. J. Mckay, R. Z. Shaban, and P. Ferguson, “Hand hygiene compliance monitoring: Do video-
based technologies offer opportunities for the future?,” Infection, Disease & Health, vol. 25,
no. 2, pp. 92–100, 2020.
[184] World Health Organization, WHO guidelines on hand hygiene in health care. World Health
Organization, 2009.
[185] N. Luangasanatip et al., “Comparative efficacy of interventions to promote hand hygiene in
hospital: Systematic review and network meta-analysis,” BMJ, vol. 351, 2015.
[186] D. J. Gould, D. Moralejo, N. Drey, J. H. Chudleigh, and M. Taljaard, “Interventions to
improve hand hygiene compliance in patient care,” Cochrane database of systematic reviews,
no. 9, 2017.
[187] N. Masroor, M. Doll, M. Stevens, and G. Bearman, “Approaches to hand hygiene monitoring:
From low to high technology approaches,” International Journal of Infectious Diseases, vol. 65,
pp. 101–104, 2017.
[188] B. C. Knepper, A. M. Miller, and H. L. Young, “Impact of an automated hand hygiene mon-
itoring system combined with a performance improvement intervention on hospital-acquired
infections,” Infection Control & Hospital Epidemiology, vol. 41, no. 8, pp. 931–937, 2020.
[189] The Joint Commission, “Measuring hand hygiene adherence: Overcoming the challenges,”
2009.
[190] M. Willmott et al., “Effectiveness of hand hygiene interventions in reducing illness absence
among children in educational settings: A systematic review and meta-analysis,” Archives of
Disease in Childhood, vol. 101, no. 1, pp. 42–50, 2016.
[191] S. J. S. Aghdassi et al., “A multimodal intervention to improve hand hygiene compliance in
peripheral wards of a tertiary care university centre: A cluster randomised controlled trial,”
Antimicrobial Resistance & Infection Control, vol. 9, no. 1, pp. 1–9, 2020.
[192] M. Biswal et al., “Evaluation of the short-term and long-term effect of a short series of hand
hygiene campaigns on improving adherence in a tertiary care hospital in India,” American
Journal of Infection Control, vol. 42, no. 9, pp. 1009–1010, 2014.
[193] Y. Suzuki, M. Morino, I. Morita, and S. Yamamoto, “The effect of a 5-year hand hygiene
initiative based on the who multimodal hand hygiene improvement strategy: An interrupted
time-series study,” Antimicrobial Resistance & Infection Control, vol. 9, pp. 1–12, 2020.
[194] R. T. Ellison III, C. M. Barysauskas, E. A. Rundensteiner, D. Wang, and B. Barton, “A
prospective controlled trial of an electronic hand hygiene reminder system,” in Open Forum
Infectious Diseases, vol. 2, p. ofv121, Oxford University Press, 2015.
[195] M. McGuckin, R. Waterman, and J. Govednik, “Hand hygiene compliance rates in the united
states—a one-year multicenter collaboration using product/volume usage measurement and
feedback,” American Journal of Medical Quality, vol. 24, no. 3, pp. 205–213, 2009.
[196] J. M. Boyce et al., “Impact of an automated hand hygiene monitoring system and addi-
tional promotional activities on hand hygiene performance rates and healthcare-associated
infections,” Infection Control & Hospital Epidemiology, vol. 40, no. 7, pp. 741–747, 2019.
[197] J. A. Srigley, C. D. Furness, G. R. Baker, and M. Gardam, “Quantification of the hawthorne
effect in hand hygiene compliance monitoring using an electronic monitoring system: A ret-
rospective cohort study,” BMJ Quality & Safety, vol. 23, no. 12, pp. 974–980, 2014.
140
[198] M. Oudah, A. Al-Naji, and J. Chahl, “Hand gesture recognition based on computer vision:
A review of techniques,” Journal of Imaging, vol. 6, no. 8:73, 2020.
[199] “Sample: Kaggle Hand Wash Dataset.” [Online], 2019. Available: https://www.kaggle.
com/realtimear/hand-wash-dataset. Accessed 18 February 2024.
[200] M. A. Ward et al., “Automated and electronically assisted hand hygiene monitoring systems:
A systematic review,” American Journal of Infection Control, vol. 42, no. 5, pp. 472–478,
2014.
[201] C. Wang et al., “Electronic monitoring systems for hand hygiene: Systematic review of tech-
nology,” Journal of Medical Internet Research, vol. 23, no. 11, p. e27880, 2021.
[202] J. Srigley et al., “Hand hygiene monitoring technology: A systematic review of efficacy,”
Journal of Hospital Infection, vol. 89, no. 1, pp. 51–60, 2015.
[203] C. Wang et al., “Accurate measurement of handwash quality using sensor armbands: Instru-
ment validation study,” JMIR MHealth Uhealth, vol. 8, no. 3, p. e17001, 2020.
[204] V. Galluzzi, Automatic recognition of healthcare worker hand hygiene. PhD thesis, University
of Iowa, 2015.
[205] V. Galluzzi, T. Herman, and P. Polgreen, “Hand hygiene duration and technique recognition
using wrist-worn sensors,” in Proceedings of the 14th International Conference on Information
Processing in Sensor Networks, pp. 106–117, 2015.
[206] J. Hoey, A. Von Bertoldi, P. Poupart, and A. Mihailidis, “Assisting persons with dementia
during handwashing using a partially observable Markov decision process.,” in Proceedings of
the 5th International Conference on Computer Vision Systems (ICVS 2007), 2007.
[207] D. F. Llorca, I. Parra, M. A´. Sotelo, and G. Lacey, “A vision-based system for automatic hand
washing quality assessment,” Machine Vision and Applications, vol. 22, no. 2, pp. 219–234,
2011.
[208] S. Yeung et al., “Vision-based hand hygiene monitoring in hospitals,” AMIA, 2016.
[209] G. Li et al., “Hand gesture recognition based on convolution neural network,” Cluster Com-
puting, vol. 22, no. 2, pp. 2719–2729, 2019.
[210] E. Prakasa and B. Sugiarto, “Video analysis on handwashing movement for the completeness
evaluation,” in 2020 International Conference on Radar, Antenna, Microwave, Electronics,
and Telecommunications (ICRAMET), pp. 296–301, IEEE, 2020.
[211] A. Nagaraj, M. Sood, C. Sureka, and G. Srinivasa, “Real-time action recognition for fine-
grained actions and the hand wash dataset,” arXiv:2210.07400, 2022.
[212] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for
video action recognition,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 1933–1941, 2016.
[213] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition
in videos,” arXiv:1406.2199, 2014.
[214] K. Cikel, M. Arzamendia, D. Gregor, D. Gutie´rrez, and S. Toral, “Evaluation of a
CNN+LSTM system for the classification of hand-washing steps,” in XIX Conference of the
Spanish Association for Artificial Intelligence (CAEPIA), 2021.
[215] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9,
no. 8, pp. 1735–1780, 1997.
[216] K. Yamamoto, M. Yoshii, F. Kinoshita, and H. Touyama, “Classification vs regression by CNN
for handwashing skills evaluations in nursing education,” in 2020 International Conference
on Artificial Intelligence in Information and Communication (ICAIIC), pp. 590–593, IEEE,
2020.
[217] W. Kay et al., “The Kinetics human action video dataset,” arXiv:1705.06950, 2017.
141
[218] Y. Yoshikawa, J. Lin, and A. Takeuchi, “STAIR actions: A video dataset of everyday home
actions,” arXiv:1804.04326, 2018.
[219] W. E. Trick et al., “Impact of ring wearing on hand contamination and comparison of hand
hygiene agents in a hospital,” Clinical infectious diseases, vol. 36, no. 11, pp. 1383–1390, 2003.
[220] A. Hautemaniere et al., “Factors determining poor practice in alcoholic gel hand rub technique
in hospital workers,” Journal of infection and Public Health, vol. 3, no. 1, pp. 25–34, 2010.
[221] World Health Organization, WHO guidelines on hand hygiene in health care. 2009.
[222] K. Cho, B. Van Merrie¨nboer, D. Bahdanau, and Y. Bengio, “On the properties of neural
machine translation: Encoder-decoder approaches,” arXiv:1409.1259, 2014.
[223] M. Ivanovs, K. Ozols, A. Dobrajs, and R. Kadikis, “Improving semantic segmentation of
urban scenes for self-driving cars with synthetic images,” Sensors, vol. 22, no. 6, p. 2252,
2022.
[224] A. Jiwani, S. Ganguly, C. Ding, N. Zhou, and D. M. Chan, “A semantic segmentation network
for urban-scale building footprint extraction using RGB satellite imagery,” arXiv:2104.01263,
2021.
[225] R. E. Huerta et al., “Mapping urban green spaces at the metropolitan level using very high
resolution satellite imagery and deep learning techniques for semantic segmentation,” Remote
Sensing, vol. 13, no. 11, p. 2031, 2021.
[226] K. Yuan, X. Zhuang, G. Schaefer, J. Feng, L. Guan, and H. Fang, “Deep-learning-based mul-
tispectral satellite image segmentation for water body detection,” IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 7422–7434, 2021.
[227] M. Wurm, T. Stark, X. X. Zhu, M. Weigand, and H. Taubenbo¨ck, “Semantic segmentation
of slums in satellite images using transfer learning on fully convolutional neural networks,”
ISPRS journal of photogrammetry and remote sensing, vol. 150, pp. 59–69, 2019.
[228] S. Thrun, “Toward robotic cars,” Communications of the ACM, vol. 53, no. 4, pp. 99–106,
2010.
[229] T. Litman, “Autonomous vehicle implementation predictions,” tech. rep., Victoria Transport
Policy Institute Victoria, Canada, 2017.
[230] J. Van Brummelen, M. O’Brien, D. Gruyer, and H. Najjaran, “Autonomous vehicle percep-
tion: The technology of today and tomorrow,” Transportation research part C: Emerging
technologies, vol. 89, pp. 384–406, 2018.
[231] F. Arena, G. Pau, and M. Collotta, “A survey on driverless vehicles: From their diffusion
to security,” Journal of Internet Services and Information Security (JISIS), vol. 8, pp. 1–19,
2018.
[232] C. Badue et al., “Self-driving cars: A survey,” Expert Systems with Applications,
vol. 165:113816, 2020.
[233] G. Biggi and J. Stilgoe, “Artificial intelligence in self-driving cars research and innovation:
A scientometric and bibliometric analysis.” [Online], 2021. Available: https://ssrn.com/
abstract=3829897. Accessed 29 June 2024.
[234] SAE International, “Taxonomy and definitions for terms related to driving automation sys-
tems for on-road motor vehicles.” [Online], 2018. Version J3016 201806. Available: https:
//www.sae.org/standards/content/j3016_201806/. Accessed 29 June 2024.
[235] B. Paden, M. Cˇa´p, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and
control techniques for self-driving urban vehicles,” IEEE Transactions on intelligent vehicles,
vol. 1, no. 1, pp. 33–55, 2016.
[236] M. Treml et al., “Speeding up semantic segmentation for autonomous driving,” in Neural
Information Processing Systems (NIPS) Workshop ‘Machine Learning for Intelligent Trans-
portation Systems’ (MLITS), 2016.
142
[237] Q. Sellat, S. K. Bisoy, and R. Priyadarshini, “Semantic segmentation for self-driving cars using
deep learning: A survey,” in Cognitive Big Data Intelligence with a Metaheuristic Approach,
pp. 211–238, Elsevier, 2022.
[238] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-
definition ground truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97,
2009.
[239] T. Scharwa¨chter, M. Enzweiler, U. Franke, and S. Roth, “Efficient multi-cue scene segmenta-
tion,” in German Conference on Pattern Recognition, pp. 435–445, Springer, 2013.
[240] H. Abu Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and C. Rother, “Augmented re-
ality meets computer vision: Efficient data generation for urban driving scenes,” International
Journal of Computer Vision, vol. 126, pp. 961–972, 2018.
[241] J. Xie, M. Kiefel, M.-T. Sun, and A. Geiger, “Semantic instance annotation of street scenes
by 3D to 2D label transfer,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3688–3697, 2016.
[242] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The SYNTHIA dataset: A
large collection of synthetic images for semantic segmentation of urban scenes,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 3234–3243, 2016.
[243] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from
computer games,” in European conference on computer vision, pp. 102–118, Springer, 2016.
[244] M. Hahner, D. Dai, C. Sakaridis, J.-N. Zaech, and L. Van Gool, “Semantic understanding of
foggy scenes with purely synthetic data,” in 2019 IEEE Intelligent Transportation Systems
Conference (ITSC), pp. 3675–3681, IEEE, 2019.
[245] B. Wymann et al., “TORCS: The open racing car simulator.” [Online], 2015. Available:
https://www.cse.chalmers.se/~chrdimi/papers/torcs.pdf. Accessed 29 June 2024.
[246] L. Berlincioni, F. Becattini, L. Galteri, L. Seidenari, and A. Del Bimbo, “Road layout un-
derstanding by generative adversarial inpainting,” in Inpainting and Denoising Challenges,
pp. 111–128, Springer, 2019.
[247] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proceedings
of the 27th annual conference on computer graphics and interactive techniques, pp. 417–424,
2000.
[248] G. Liu et al., “Image inpainting for irregular holes using partial convolutions,” in Proceedings
of the European Conference on Computer Vision (ECCV), pp. 85–100, 2018.
[249] F. S. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez, “Effective
use of synthetic data for urban scene semantic segmentation,” in Proceedings of the European
Conference on Computer Vision (ECCV), pp. 84–100, 2018.
[250] A. Cordeiro, L. F. Rocha, C. Costa, P. Costa, and M. F. Silva, “Bin picking approaches
based on deep learning techniques: A state-of-the-art survey,” in 2022 IEEE International
Conference on Autonomous Robot Systems and Competitions (ICARSC), pp. 110–117, IEEE,
2022.
[251] D. Duplevska, M. Ivanovs, J. Arents, and R. Kadikis, “Sim2real image translation to improve
a synthetic dataset for a bin picking task,” in 2022 IEEE 27th International Conference on
Emerging Technologies and Factory Automation (ETFA), pp. 1–7, IEEE, 2022.
[252] A. Dzedzickis, J. Subacˇiu¯te˙-Zˇemaitiene˙, E. Sˇutinys, U. Samukaite˙-Bubniene˙, and
V. Bucˇinskas, “Advanced applications of industrial robotics: New trends and possibilities,”
Applied Sciences, vol. 12, no. 1:135, 2021.
[253] J. Arents and M. Greitans, “Smart industrial robot control trends, challenges and opportu-
nities within manufacturing,” Applied Sciences, vol. 12, no. 2:937, 2022.
143
[254] The MathWorks, Inc., “Gazebo simulation of semi-structured intelligent
bin picking for UR5e using YOLO and PCA-based object detection.” [On-
line]. Available: https://www.mathworks.com/help/robotics/urseries/ug/
gazebo-simulation-ur5e-semistructured-intelligent-bin-picking-example.html.
Accessed 27 February 2024.
[255] Q. Bai, S. Li, J. Yang, Q. Song, Z. Li, and X. Zhang, “Object detection recognition and robot
grasping based on machine learning: A survey,” IEEE Access, vol. 8, pp. 181855–181879,
2020.
[256] H. Choi et al., “On the use of simulation in robotics: Opportunities, challenges, and sugges-
tions for moving forward,” Proceedings of the National Academy of Sciences, vol. 118, no. 1,
p. e1907856118, 2021.
[257] J. Anderson, Methods and Applications of Synthetic Data Generation. PhD thesis, Clemson
University, 2021.
[258] N. Jakobi, P. Husbands, and I. Harvey, “Noise and the reality gap: The use of simulation in
evolutionary robotics,” in Advances in Artificial Life: Third European Conference on Artificial
Life, pp. 704–720, Springer, 1995.
[259] E. Buls, R. Kadikis, R. Cacurs, and J. Arents, “Generation of synthetic training data for
object detection in piles,” in Eleventh International Conference on Machine Vision (ICMV
2018), vol. 11041, pp. 533–540, SPIE, 2019.
[260] J. Arents et al., “Synthetic data of randomly piled, similar objects for deep learning-based
object detection,” in International Conference on Image Analysis and Processing, pp. 706–
717, Springer, 2022.
[261] V. Fesˇcˇenko, J. Arents, and R. Kadikis, “Synthetic data generation for visual detection of
flattened PET bottles,” Machine Learning and Knowledge Extraction, vol. 5, no. 1, pp. 14–
28, 2023.
[262] N. Su¨nderhauf et al., “The limits and potentials of deep learning for robotics,” The Interna-
tional journal of robotics research, vol. 37, no. 4-5, pp. 405–420, 2018.
[263] M. Carranza-Garc´ıa, J. Torres-Mateo, P. Lara-Ben´ıtez, and J. Garc´ıa-Gutie´rrez, “On the
performance of one-stage and two-stage object detectors in autonomous vehicles using camera
data,” Remote Sensing, vol. 13, no. 1:89, 2020.
[264] L. Tian, N. M. Thalmann, D. Thalmann, Z. Fang, and J. Zheng, “Object grasping of hu-
manoid robot based on YOLO,” in Advances in Computer Graphics: 36th Computer Graphics
International Conference, pp. 476–482, Springer, 2019.
[265] Z. Cao, T. Liao, W. Song, Z. Chen, and C. Li, “Detecting the shuttlecock for a badminton
robot: A YOLO based approach,” Expert Systems with Applications, vol. 164:113833, 2021.
[266] G. Zhaoxin, L. Han, Z. Zhijiang, and P. Libo, “Design a robot system for tomato picking
based on YOLO v5,” IFAC-PapersOnLine, vol. 55, no. 3, pp. 166–171, 2022.
[267] A. S. Olesen, B. B. Gergaly, E. A. Ryberg, M. R. Thomsen, and D. Chrysostomou, “A
collaborative robot cell for random bin-picking based on deep learning policies and a multi-
gripper switching strategy,” Procedia Manufacturing, vol. 51, pp. 3–10, 2020.
[268] S. Lee and Y. Lee, “Real-time industrial bin-picking with a hybrid deep learning-engineering
approach,” in 2020 IEEE International Conference on Big Data and Smart Computing (Big-
Comp), pp. 584–588, IEEE, 2020.
[269] P. Torres, J. Arents, H. Marques, and P. Marques, “Bin-picking solution for randomly placed
automotive connectors based on machine learning techniques,” Electronics, vol. 11, no. 3:476,
2022.
[270] J. Tremblay et al., “Training deep networks with synthetic data: Bridging the reality gap
by domain randomization,” in Proceedings of the IEEE conference on computer vision and
pattern recognition workshops, pp. 969–977, 2018.
144
[271] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Pro-
ceedings of the AAAI conference on artificial intelligence, vol. 30, 2016.
[272] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-
level domain adaptation with generative adversarial networks,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 3722–3731, 2017.
[273] A. Shrivastava and other, “Learning from simulated and unsupervised images through ad-
versarial training,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 2107–2116, 2017.
[274] Bousmalis et al., “Using simulation and domain adaptation to improve efficiency of deep
robotic grasping,” in 2018 IEEE international conference on robotics and automation (ICRA),
pp. 4243–4250, IEEE, 2018.
[275] K. Rao et al., “RL-CycleGAN: Reinforcement learning aware simulation-to-real,” in Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11157–
11166, 2020.
[276] D. Ho et al., “RetinaGAN: An object-aware approach to sim-to-real transfer,” in 2021 IEEE
International Conference on Robotics and Automation (ICRA), pp. 10920–10926, IEEE, 2021.
[277] Blender Foundation, “Blender 3D computer graphics software.” [Online], 2022. Available:
https://www.blender.org/. Accessed 19 August 2024.
[278] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using
cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference
on computer vision, pp. 2223–2232, 2017.
[279] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical im-
age segmentation,” in Medical image computing and computer-assisted intervention–MICCAI
2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part
III 18, pp. 234–241, Springer, 2015.
[280] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill,
vol. 1, no. 10, 2016.
[281] W. Shi et al., “Is the deconvolution layer the same as a convolutional layer?,”
arXiv:1609.07009, 2016.
[282] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained
by a two time-scale update rule converge to a local Nash equilibrium,” Advances in neural
information processing systems, vol. 30, 2017.
[283] K. Kleeberger, R. Bormann, W. Kraus, and M. F. Huber, “A survey on learning-based robotic
grasping,” Current Robotics Reports, vol. 1, pp. 239–249, 2020.
[284] C. M. Leung et al., “A guide to the organ-on-a-chip,” Nature Reviews Methods Primers, vol. 2,
no. 1, p. 33, 2022.
[285] M. Ivanovs et al., “Synthetic image generation with a fine-tuned latent diffusion model for
organ on chip cell image classification,” in Proceedings of 2023 Signal Processing: Algorithms,
Architectures, Arrangements, and Applications (SPA), pp. 148–153, IEEE, 2023.
[286] V. Movcˇana et al., “Organ-on-a-chip (OOC) image dataset for machine learning and tissue
model evaluation,” Data, vol. 9, no. 2, p. 28, 2024.
[287] S. A. Ajagbe, K. A. Amuda, M. A. Oladipupo, F. A. Oluwaseyi, and K. I. Okesola, “Multi-
classification of Alzheimer disease on magnetic resonance images (MRI) using deep convo-
lutional neural network (DCNN) approaches,” International Journal of Advanced Computer
Research, vol. 11, no. 53, p. 51, 2021.
[288] A. Iqbal, M. Sharif, M. A. Khan, W. Nisar, and M. Alhaisoni, “FF-UNet: a U-shaped deep
convolutional neural network for multimodal biomedical image segmentation,” Cognitive Com-
putation, vol. 14, no. 4, pp. 1287–1302, 2022.
145
[289] Villa-Pulgarin et al., “Optimized convolutional neural network models for skin lesion classifi-
cation.,” Computers, Materials & Continua, vol. 70, no. 2, 2022.
[290] S. Sharma et al., “Performance evaluation of the deep learning based convolutional neural
network approach for the recognition of chest X-ray images,” Frontiers in oncology, vol. 12,
2022.
[291] Rodriguez-Ruiz et al., “Stand-alone artificial intelligence for breast cancer detection in mam-
mography: Comparison with 101 radiologists,” JNCI: Journal of the National Cancer Insti-
tute, vol. 111, no. 9, pp. 916–922, 2019.
[292] H. Ali, S. Murad, and Z. Shah, “Spot the fake lungs: Generating synthetic medical images
using neural diffusion models,” in Irish Conference on Artificial Intelligence and Cognitive
Science, pp. 32–39, Springer, 2022.
[293] E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,” arXiv:2106.09685,
2021.
[294] S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 2213–2222, 2017.
[295] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv:1312.6114, 2013.
[296] A. Volokitin et al., “Modelling the distribution of 3D brain MRI using a 2D slice VAE,” in
Proceedings of 23rd International Conference on Medical Image Computing and Computer-
Assisted Intervention–MICCAI 2020, pp. 657–666, Springer, 2020.
[297] D. E. Diamantis, P. Gatoula, and D. K. Iakovidis, “EndoVAE: Generating endoscopic images
with a variational autoencoder,” in 2022 IEEE 14th Image, Video, and Multidimensional
Signal Processing Workshop (IVMSP), pp. 1–5, IEEE, 2022.
[298] Y. Kim, S. Wiseman, A. Miller, D. Sontag, and A. Rush, “Semi-amortized variational au-
toencoders,” in International Conference on Machine Learning, pp. 2678–2687, PMLR, 2018.
[299] J. Tomczak and M. Welling, “VAE with a VampPrior,” in International Conference on Arti-
ficial Intelligence and Statistics, pp. 1214–1223, PMLR, 2018.
[300] L. Ma, R. Shuai, X. Ran, W. Liu, and C. Ye, “Combining DC-GAN with ResNet for blood cell
image classification,” Medical & biological engineering & computing, vol. 58, pp. 1251–1264,
2020.
[301] T. Iqbal and H. Ali, “Generative adversarial network for medical images (MI-GAN),” Journal
of medical systems, vol. 42, pp. 1–11, 2018.
[302] H. Chen, “Challenges and corresponding solutions of generative adversarial networks (GANs):
A survey study,” in Journal of Physics: Conference Series, vol. 1827, p. 012066, IOP Pub-
lishing, 2021.
[303] M. Akrout et al., “Diffusion-based data augmentation for skin disease classification: Impact
across original medical datasets to fully synthetic images,” arXiv:2301.04802, 2023.
[304] P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari, “Adapting pretrained vision-
language foundational models to medical imaging domains,” arXiv:2210.04133, 2022.
[305] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learn-
ing using nonequilibrium thermodynamics,” in International Conference on Machine Learn-
ing, pp. 2256–2265, PMLR, 2015.
[306] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A
survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9,
pp. 10850–10869, 2023.
[307] X. Li, M. Sakevych, G. Atkinson, and V. Metsis, “Biodiffusion: A versatile diffusion model
for biomedical signal synthesis,” Bioengineering, vol. 11, no. 4, p. 299, 2024.
[308] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural
Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
146
[309] P. Esser et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in
Forty-first International Conference on Machine Learning, 2024.
[310] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv:1810.04805, 2018.
[311] A. Vaswani et al., “Attention is all you need,” Advances in neural information processing
systems, vol. 30, 2017.
[312] R. Rombach et al., “Stable Diffusion.” [Online], 2022. Available: https://github.com/
CompVis/stable-diffusion. Accessed 6 August 2024.
[313] R. Rombach, A. Blattmann, K. Crowson, A. Khaliq, and P. Esser, “Latent diffusion models.”
[Online], 2022. Available: https://github.com/CompVis/latent-diffusion. Accessed 6
August 2024.
[314] A. Radford et al., “Learning transferable visual models from natural language supervision,”
in International Conference on Machine Learning, pp. 8748–8763, PMLR, 2021.
[315] “Stable Diffusion web UI.” [Online], 2022. Available: https://github.com/AUTOMATIC1111/
stable-diffusion-webui. Accessed 6 August 2024.
[316] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPM-Solver: A fast ODE solver for
diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information
Processing Systems, vol. 35, pp. 5775–5787, 2022.
[317] I. Stenbit, F. Chollet, and L. Wood, “A walk through latent space with Stable Diffusion.”
[Online], 2022. Available: https://keras.io/examples/generative/random_walks_with_
stable_diffusion/. Accessed 7 August 2024.
147