UNFACE:
Fine grained facial analysis for unmasking hidden information |
||||||||||||||||||||||
The human face is a fundamental source of
information to understand the behavior of individuals. Traditionally this has
been exploited in computer vision for the recognition of identity and
expressions but it has been recently suggested that the information that
could be extracted from the face goes well beyond this and can be indicative
of things such as deception, heart rate, psychological states or even
psychiatric disorders such as autism or depression Some of this information, however, might be
not apparent or it might even be hidden to us, and it could only be recovered
by means of specialized techniques. An iconic example is the detection of
cardiac heart rate by amplifying the subtle color changes of the face due to
the blood flow, which are invisible to the human eye. The goal of the UNFACE project has been
to addresses fine grained facial analysis to unmask different sources of
information hidden in the face. The project has delivered research results in
both fundamental facial analysis algorithms (e.g. landmark localization and
tracking, facial expressions, head pose estimation, and facial surface
reconstruction) and in a few selected application areas to demonstrate the
practical relevance of the developed methods (e.g. affective computing,
automatic lip reading, dysmorphology analysis and deception detection), Among the achievements of the UNFACE
project, we highlight 1) the design of advanced deep learning architectures
for accuracy and robust tracking of facial landmarks under realistic
(in-the-wild) scenarios, for which the resulting models have been made
publicly available; 2) the use of spectral decomposition methods to improve
the accuracy of facial expression analysis in 3D, as well as to improve dense
surface correspondences for 3D facial reconstruction; 3) the creation of the first
3D baby face model, built exclusively from infant facial surfaces within an
innovative pipeline based on the aforementioned spectral correspondences that
is further capable to automatically derive the model template instead of
requiring a pre-existing one as commonly required by other state of the art
methods; 4) the development of a database for lie detection based on a
competitive game scenario that promotes the frequent and motivated use of
lies by the participants, recorded with multiple cameras that provide both 2D
and 3D information of the participant faces; 5) the development of
data-driven representations that make possible continuous lip reading in
Spanish, and potentially also in other languages without the need of
replicating the huge data resources needed to train other state of the art
lip reading systems, which in practice constrain their applicability only to
English. Principal Investigators: Xavier Binefa & Federico Sukno This project was funded by the 2017 call from “Programa Estatal de Fomento de la Investigación Científica y Técnica de Excelencia”
from the Spanish Ministry of Economy, Industry and Competitiveness. Index of Project Results: 1. Composite
recurrent network with internal denoising for facial alignment in still and
video images in the wild, Image and Vision Computing, 2021. 2. Survey
on 3D face reconstruction from uncalibrated images, Computer
Science Review, 2021. 3. 3D
Fetal Face Reconstruction from Ultrasound Imaging, GRAPP 2021. 4. An
Enhanced Adversarial Network with Combined Latent Features for Spatio-Temporal Facial Affect Estimation in the Wild,
VISAPP 2021. 5. Spectral
Correspondence Framework for Building a 3D Baby Face Model , FG
2020. 6. End-to-end
facial and physiological model for Affective Computing and applications,
FG 2020. 7. Refining
the resolution of craniofacial dysmorphology in bipolar disorder as an index
of brain dysmorphogenesis, Psychiatry
Research, 291: 113243, 2020. 8. Cogans For Unsupervised Visual Speech Adaptation to
New Speakers, ICASSP 2020. 9. Tensor
Decomposition and Non-linear Manifold Modeling for 3D Head Pose Estimation,
International Journal of Computer Vision, 127(10): 1565–1585, 2019. 10. Three-Dimensional
Face Reconstruction from Uncalibrated Photographs: Application to Early
Detection of Genetic Syndromes, MICCAI CLIP 2019. 11. Robust
facial alignment with internal denoising auto-encoder, CRV 2019. 12. Lip-Reading
with Limited-Data Network, EUSIPCO 2019. 13. Fully
end-to-end composite recurrent convolution network for deformable facial
tracking in the wild, FG 2019. 14. Heatmap-guided
balanced deep convolution networks for family classification in the wild,
FG 2019. 15. Optimizing
Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish, Communication
in Computer and Information Science, 2019. 16. Multi-instance
dynamic ordinal random fields for weakly supervised facial behavior analysis,
IEEE Transactions on Image Processing, 27(8): 3969–3982, 2018. 17. Automatic
local shape spectrum analysis for 3D facial expression recognition,
Image and Vision Computing, 79: 86–98, 2018. 18. 3D
head pose estimation using tensor decomposition and non-linear manifold
modeling, 3DV 2018. 19. Survey
on Automatic Lip-Reading in the Era of Deep Learning, Image and
Vision Computing, 78: 53–72, 2018. 20. A quantitative comparison of methods for 3D face
reconstruction from 2D images, FG 2018. Further
activities related to the project ·
4 PhD Theses ·
12 Final bachelor/master
projects |
||||||||||||||||||||||
D. Aspandi, O. Martinez, F.M. Sukno and X. Binefa Image and Vision Computing, 111(7): 104189, 2021. |
||||||||||||||||||||||
The Facial alignment is an essential task for many
higher level facial analysis applications, such as animation, human activity
recognition and human - computer interaction. Although the recent
availability of big datasets and powerful deep-learning approaches have
enabled major improvements on the state of the art accuracy, the performance
of current approaches can severely deteriorate when dealing with images in
highly unconstrained conditions, which limits the real-life applicability of
such models. In this paper, we propose a composite recurrent tracker with
internal denoising that jointly address both single image facial alignment
and deformable facial tracking in the wild. Specifically, we incorporate
multilayer LSTMs to model temporal dependencies with variable length and
introduce an internal denoiser which selectively enhances the input images to
improve the robustness of our overall model. We achieve this by combining 4
different sub-networks that specialize in each of the key tasks that are
required, namely face detection, bounding-box tracking, facial region validation
and facial alignment with internal denoising. These blocks are endowed with
novel algorithms resulting in a facial tracker that is both accurate, robust
to in-the-wild settings and resilient against drifting. We demonstrate this
by testing our model on 300-W and Menpo datasets
for single image facial alignment, and 300-VW dataset for deformable facial
tracking. Comparison against 20 other state of the art methods demonstrates
the excellent performance of the proposed approach. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
|
|||||||||||||||||||||
|
|
|
||||||||||||||||||||
A. Morales, G. Piella
and F.M. Sukno Computer
Science Review, 40(5): 100400, 2021. |
||||||||||||||||||||||
Recently, a lot of attention has been focused on the
incorporation of 3D data into face analysis and its applications. Despite
providing a more accurate representation of the face, 3D facial images are
more complex to acquire than 2D pictures. As a consequence, great effort has
been invested in developing systems that reconstruct 3D faces from an
uncalibrated 2D image. However, the 3D-from-2D face reconstruction problem is
ill-posed, thus prior knowledge is needed to restrict the solutions space. In
this work, we review 3D face reconstruction methods proposed in the last
decade, focusing on those that only use 2D pictures captured under
uncontrolled conditions. We present a classification of the proposed methods
based on the technique used to add prior knowledge, considering three main
strategies, namely, statistical model fitting, photometry, and deep learning,
and reviewing each of them separately. In addition, given the relevance of
statistical 3D facial models as prior knowledge, we explain the construction
procedure and provide a list of the most popular publicly available 3D facial
models. After the exhaustive study of 3D-from-2D face reconstruction
approaches, we observe that the deep learning strategy is rapidly growing
since the last few years, becoming the standard choice in replacement of the
widespread statistical model fitting. Unlike the other two strategies,
photometry-based methods have decreased in number due to the need for strong
underlying assumptions that limit the quality of their reconstructions
compared to statistical model fitting and deep learning methods. The review
also identifies current challenges and suggests avenues for future research. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
3D Fetal Face
Reconstruction from Ultrasound Imaging |
||||||||||||||||||||||
A. Alomar, A.
Morales, K. Vellve, A.R. Porras, F. Crispi, M.G. Linguraru, G. Piella and F.M. Sukno Proc. 18th International Joint Conference on Computer
Vision, Imaging and Computer Graphics Theory and Applications, Vol 4 VISAPP,
pp. 615–624, 2021. |
||||||||||||||||||||||
The fetal face contains essential information in the
evaluation of congenital malformations and the fetal brain function, as its
development is driven by genetic factors at early stages of embryogenesis.
Three-dimensional ultrasound (3DUS) can provide information about the facial
morphology of the fetus, but its use for prenatal diagnosis is challenging
due to imaging noise, fetal movements, limited field-of-view, low soft-tissue
contrast, and occlusions. In this paper, we propose a fetal face
reconstruction algorithm from 3DUS images based on a novel statistical
morphable model of newborn faces, the BabyFM. |
||||||||||||||||||||||
We test
the feasibility of using newborn statistics to accurately reconstruct fetal
faces by fitting the regularized morphable model to the noisy 3DUS images.
The algorithm is capable of reconstructing the whole facial morphology of
babies from one or several ultrasound scans to handle adverse conditions
(e.g. missing parts, noisy data), and it has the potential to aid in-utero
diagnosis for conditions that involve facial dysmorphology. |
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
An Enhanced Adversarial
Network with Combined Latent Features for Spatio-Temporal
Facial Affect Estimation in the Wild |
||||||||||||||||||||||
D. Aspandi, F.M. Sukno, B. Schuller and X. Binefa Proc. 18th International Joint Conference on Computer
Vision, Imaging and Computer Graphics Theory and Applications, (Online & Streaming), Vol 4 VISAPP, pp. 172–181, 2021. |
||||||||||||||||||||||
We Affective Computing has recently attracted the attention
of the research community, due to its numerous applications in diverse areas.
In this context, the emergence of video-based data allows to enrich the
widely used spatial features with the inclusion of temporal information.
However, such spatio-temporal modelling often
results in very high-dimensional feature spaces and large volumes of data,
making training difficult and time consuming. This paper addresses these
shortcomings by proposing a novel model that efficiently extracts both
spatial and temporal features of the data by means of its enhanced temporal
modelling based on latent features. Our proposed model consists of three
major networks, coined Generator, Discriminator, and Combiner, which are
trained in an adversarial setting combined with curriculum learning to enable
our adaptive attention modules. In our experiments, we show the effectiveness
of our approach by reporting our competitive results on both the AFEW-VA and
SEWA datasets, suggesting that temporal modelling improves the affect estimates
both in qualitative and quantitative terms. Furthermore, we find that the
inclusion of attention mechanisms leads to the highest accuracy improvements,
as its weights seem to correlate well with the appearance of facial
movements, both in terms of temporal localisation
and intensity. Finally, we observe the sequence length of around 160\,ms to be the optimum one for temporal modelling, which is
consistent with other relevant findings utilising
similar lengths. |
||||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Spectral Correspondence
Framework for Building a 3D Baby Face Model |
||||||||||||||||||||||
A. Morales, A.R.
Porras, L. Tu, M.G. Linguraru,
G. Piella and F.M. Sukno Proc. 15th IEEE International Conference on Automatic Face and
Gesture Recognition, Buenos Aires, Argentina, pp. 507–514, 2020. |
||||||||||||||||||||||
Early detection of facial dysmorphology –variations
of the normal facial geometry- is essential for the timely detection of
genetic conditions, which has a significant impact in the reduction of the
mortality and morbidity associated with them. A model encoding the normal
variability in the healthy population can serve as a reference to quantify
the often subtle facial abnormalities that are present in young patients with
such conditions. |
||||||||||||||||||||||
In this
paper, we present the first facial model constructed exclusively from newborn
data, the Baby Face Model (BabyFM). Our model is
built from 3D scans with an innovative pipeline based on least squared
conformal maps (LSCM). LSCM are piece-wise linear mappings that project the
training faces to a common 2D space minimising the
conformal distortion. This process allows improving the correspondences
between 3D faces, which is particularly important for the identification of
subtle dysmorphology. We evaluate the ability of our BabyFM
to recover the babys facial morphology from a set
of 2D images by comparing it to state-of-the-art facial models. We also
compare it to models built following an analogous pipeline to the one
proposed in this paper but using nonrigid iterative closest point (NICP) to
establish dense correspondences between the training faces. The results show
that our model reconstructs the facial morphology of babies with
significantly smaller errors than the state-of-the-art models (p < 10-4
and the “NICP models” (p < 0:01). |
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
|
|
||||||||||||||||||||
End-to-end facial and physiological
model for Affective Computing and applications |
||||||||||||||||||||||
J. Comas, D. Aspandi and X. Binefa Proc. 15th IEEE International Conference on Automatic Face and
Gesture Recognition, Buenos Aires, Argentina, pp. 507–514, 2020. |
||||||||||||||||||||||
In recent years, affective computing and its
applications have become a fast-growing research topic. Furthermore, the rise
of deep learning has introduced significant improvements in the emotion
recognition system compared to classical methods. In this work, we propose a
multi-modal emotion recognition model based on deep learning techniques using
the combination of peripheral physiological signals and facial expressions.
Moreover, we present an improvement to proposed models by introducing latent
features extracted from our internal Bio Auto-Encoder (BAE). Both models are
trained and evaluated on AMIGOS datasets reporting valence, arousal, and
emotion state classification. Finally, to demonstrate a possible medical
application in affective computing using deep learning techniques, we applied
the proposed method to the assessment of anxiety therapy. To this purpose, a
reduced multi-modal database has been collected by recording facial
expressions and peripheral signals such as electrocardiogram (ECG) and
galvanic skin response (GSR) of each patient. Valence and arousal estimates
were extracted using our proposed model across the duration of the therapy,
with successful evaluation to the different emotional changes in the temporal
domain. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
Cogans For
Unsupervised Visual Speech Adaptation to New Speakers |
||||||||||||||||||||||
A.
Fernandez-Lopez, A. Karaali, N, Harte and F.M.
Sukno Proc. IEEE
International Conference on Acoustics, Speech and Signal Processing, Barcelona,
Spain, pp. 6294-6298, 2020. |
||||||||||||||||||||||
Audio-Visual Speech Recognition (AVSR) faces the
difficult task of exploiting acoustic and visual cues simultaneously.
Augmenting speech with the visual channel creates its own challenges, e.g.
every person has unique mouth movements, making the generalization of visual
models very difficult. This factor motivates our focus on the generalization
of speaker-independent (SI) AVSR systems especially in noisy environments by
exploiting the visual domain. Specifically, we are the first to explore the
visual adaptation of an SI-AVSR system to an unknown and unlabelled
speaker. We adapt an AVSR system trained in a source domain to decode samples
in a target domain without the need for labels in the target domain. For the
domain adaptation of the unknown speaker, we use Coupled Generative
Adversarial Networks to automatically learn a joint distribution of
multi-domain images. We evaluate our character-based AVSR system on the
TCD-TIMIT dataset and obtain up to a 10% average improvement with respect to
its AVSR system equivalent. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
Refining the resolution of
craniofacial dysmorphology in bipolar disorder as an index of brain dysmorphogenesis |
||||||||||||||||||||||
S. Katina, B.D.
Kelly, M.A. Rojas, F.M. Sukno, A. McDermott, R.J. Hennessy, A. Lane, P.F. Whelan,
A.W. Bowman and J.L. Waddington Psychiatry
Research, 291: 113243, 2020. |
||||||||||||||||||||||
As understanding of the genetics of bipolar disorder
increases, controversy endures regarding whether the origins of this illness
include early maldevelopment. Clarification would be facilitated by a ‘hard’
biological index of fetal developmental abnormality, among which craniofacial
dysmorphology bears the closest embryological relationship to brain dysmorphogenesis. Therefore, 3D laser surface imaging was
used to capture the facial surface of 21 patients with bipolar disorder and
45 control subjects; 21 patients with schizophrenia were also studied. |
||||||||||||||||||||||
Surface
images were subjected to geometric morphometric analysis in non-affine space
for more incisive resolution of subtle, localised dysmorphologies
that might distinguish patients from controls. Complex and more biologically
informative, non-linear changes distinguished bipolar patients from control
subjects. On a background of minor dysmorphology of the upper face, maxilla,
midface and periorbital regions, bipolar disorder was characterised
primarily by the following dysmorphologies: (a) retrusion and shortening of
the premaxilla, nose, philtrum, lips and mouth (the frontonasal prominences),
with (b) some protrusion and widening of the mandible-chin. The topography of
facial dysmorphology in bipolar disorder indicates disruption to early
development in the frontonasal process and, on embryological grounds,
cerebral dysmorphogenesis in the forebrain, most
likely between the 10th and 15th week of fetal life. |
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Tensor Decomposition and
Non-linear Manifold Modeling for 3D Head
Pose Estimation |
||||||||||||||||||||||
D. Derkach, A. Ruiz and F.M. Sukno International Journal of Computer Vision, 127(10): 1565–1585, 2019. |
||||||||||||||||||||||
We Head pose estimation is a challenging computer
vision problem with important applications in different scenarios such as
human–computer interaction or face recognition. In this paper, we present a
3D head pose estimation algorithm based on non-linear manifold learning. A
key feature of the proposed approach is that it allows modeling the
underlying 3D manifold that results from the combination of rotation angles.
To do so, we use tensor decomposition to generate separate subspaces for each
variation factor and show that each of them has a clear structure that can be
modeled with cosine functions from a unique shared parameter per angle. Such
representation provides a deep understanding of data behavior. We show that
the proposed framework can be applied to a wide variety of input features and
can be used for different purposes. Firstly, we test our system on a publicly
available database, which consists of 2D images and we show that the cosine
functions can be used to synthesize rotated versions from an object from
which we see only a 2D image at a specific angle. Further, we perform 3D head
pose estimation experiments using other two types of features: automatic
landmarks and histogram-based 3D descriptors. We evaluate our approach on two
publicly available databases, and demonstrate that angle estimations can be
performed by optimizing the combination of these cosine functions to achieve
state-of-the-art performance. |
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
Tensor decompositon of multi-view data yields
manifold subspaces whos components follow trigonometric curves |
|
||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Optimizing Phoneme-to-Viseme Mapping for Continuous
Lip-Reading in Spanish |
||||||||||||||||||||||
A. Fernandez-Lopez and F.M. Sukno Computer Vision, Imaging and
Computer Graphics - Theory and Application, Communications in Computer
and Information Science book series, Vol 983, pp. 305–328, 2019 |
||||||||||||||||||||||
Speech is the most used communication method between
humans and it is considered a multisensory process. Even though there is a
popular belief that speech is something that we hear, there is overwhelming
evidence that the brain treats speech as something that we hear and see. Much
of the research has focused on Automatic Speech Recognition (ASR) systems,
treating speech primarily as an acoustic form of communication. In the last
years, there has been an increasing interest in systems for Automatic Lip-Reading
(ALR), although exploiting the visual information has been proved to be
challenging. One of the main problems in ALR is how to make the system robust
to the visual ambiguities that appear at the word level. These ambiguities
make confused and imprecise the definition of the minimum distinguishable
unit of the video domain. In contrast to the audio domain, where the phoneme
is the standard minimum auditory unit, there is no consensus on the
definition of the minimum visual unit (the viseme). In this work, we focus on
the automatic construction of a phoneme-to-viseme mapping based on visual
similarities between phonemes to maximize word recognition. We investigate
the usefulness of different phoneme-to-viseme mappings, obtaining the best
results for intermediate vocabulary lengths. We construct an automatic system
that uses DCT and SIFT descriptors to extract the main characteristics of the
mouth region and HMMs to model the statistic relations of both viseme and
phoneme sequences. We test our system in two Spanish corpora with continuous
speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our
results indicate that we are able to recognize 47% (resp. 51%) of the
phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show
additional results that support the usefulness of visemes. Experiments on a
comparable ALR system trained exclusively using phonemes at all its stages
confirm the existence of strong visual ambiguities between groups of
phonemes. This fact and the higher word accuracy obtained when using
phoneme-to-viseme mappings, justify the usefulness of visemes instead of the
direct use of phonemes for ALR. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
L. Tu, A.R. Porras, M.C. A. Morales, D.A. Perez, G. Piella, F.M. Sukno
and M.G. Linguraru Proc. 8th MICCAI Clinical Image-based Procedures Workshop, Shenzhen,
China, pp. 182–189, 2019. |
||||||||||||||||||||||
Facial analysis from photography supports the early
identification of genetic syndromes, but clinically-acquired uncalibrated
images suffer from image pose and illumination variability. Although 3D
photography overcomes some of the challenges of 2D images, 3D scanners are
not typically available. We present an optimization method for 3D face
reconstruction from uncalibrated 2D photographs of the face using a novel
statistical shape model of the infant face. First, our method creates an
initial estimation of the camera pose for each 2D photograph using the
average shape of the statistical model and a set of 2D facial landmarks.
Second, it calculates the camera pose and the parameters of the statistical
model by minimizing the distance between the projection of the estimated 3D
face in the image plane of each camera and the observed 2D face geometry.
Using the reconstructed 3D faces, we automatically extract a set of 3D
geometric and appearance descriptors and we use them to train a classifier to
identify facial dysmorphology associated with genetic syndromes. We evaluated
our face reconstruction method on 3D photographs of 54 subjects (age range
0–3 years), and we obtained a point-to-surface error of 2.01 ± 0.54%, which
was a significant improvement over 2.98 ± 0.64% using state-of-the-art
methods (p < 0.001). Our classifier detected genetic syndromes
from the reconstructed 3D faces from the 2D photographs with 100% sensitivity
and 92.11% specificity. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Robust facial alignment
with internal denoising auto-encoder |
||||||||||||||||||||||
D. Aspandi, O. Martinez, F.M. Sukno and X. Binefa Proc. 16th Conference
on Computer & Robot Vision, Ontario, Canada, pp. 143–150, 2019. |
||||||||||||||||||||||
The development of facial alignment models is
growing rapidly thanks to the availability of large facial landmarked
datasets and powerful deep learning models. However, important challenges
still remain for facial alignment models to work on images under extreme
conditions, such as severe occlusions or large variations in pose and
illumination. Current attempts to overcome this limitation have mainly
focused on building robust feature extractors with the assumption that the
model will be able to discard the noise and select only the meaningful
features. However, such an assumption ignores the importance of understanding
the noise that characterizes unconstrained images, which has been shown to
benefit computer vision models if used appropriately on the learning
strategy. Thus, in this paper we investigate the introduction of specialized
modules for noise detection and removal, in combination with our
state-of-the-art facial alignment module and show that this leads to improved
robustness both to synthesized noise and in-the-wild conditions. The proposed
model is built by combining two major subnetworks: internal image denoiser
(based on the Auto-Encoder architecture) and facial landmark localiser (based on the inception-resnet
architecture). Our results on the 300-W and Menpo
datasets show that our model can effectively handle different types of
synthetic noise, which also leads to enhanced robustness in real-world
unconstrained settings, reaching top state-of-the-art accuracy. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
A. Fernandez-Lopez
and F.M. Sukno Proc. 27th European
Signal Processing Conference, A Coruña, Spain, 2019. |
||||||||||||||||||||||
The development of Automatic Lip-Reading (ALR)
systems is currently dominated by Deep Learning (DL) approaches. However, DL
systems generally face two main issues related to the amount of data and the
complexity of the model. To find a balance between the amount of available
training data and the number of parameters of the model, in this work we
introduce an end-to-end ALR system that combines CNNs and LSTMs and can be
trained without large-scale databases. To this end, we propose to split the
training by modules, by automatically generating weak labels per frames,
termed visual units. These weak visual units are representative enough to
guide the CNN to extract meaningful features that when combined with the
context provided by the temporal module, are sufficiently informative to
train an ALR system in a very short time and with no need for manual
labeling. The system is evaluated in the well-known OuluVS2 database to
perform sentence-level classification. We obtain an accuracy of 91.38% which
is comparable to state-of the-art results but, differently from most previous
approaches, we do not require the use of external training data. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
Fully end-to-end composite
recurrent convolution network for deformable facial tracking in the wild |
||||||||||||||||||||||
D. Aspandi, O. Martinez, F.M. Sukno and X. Binefa Proc. 14th International Conference on Automatic Face &
Gesture Recognition, Lille, France, 2019. |
||||||||||||||||||||||
Human facial tracking is an important task in
computer vision, which has recently lost pace compared to other facial
analysis tasks. The majority of current available tracker possess two major
limitations: their little use of temporal information and the widespread use
of handcrafted features, without taking full advantage of the large annotated
datasets that have recently become available. In this paper we present a
fully end-to-end facial tracking model based on current state of the art deep
model architectures that can be effectively trained from the available
annotated facial landmark datasets. We build our model from the recently
introduced general object tracker Re3, which allows modeling the short and
long temporal dependency between frames by means of its internal Long Short
Term Memory (LSTM) layers. Facial tracking experiments on the challenging
300-VW dataset show that our model can produce state of the art accuracy and
far lower failure rates than competing approaches. We specifically compare the
performance of our approach modified to work in tracking-by-detection mode
and showed that, as such, it can produce results that are comparable to state
of the art trackers. However, upon activation of our tracking mechanism, the
results improve significantly, confirming the advantage of taking into
account temporal dependencies. |
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Heatmap-guided balanced deep
convolution networks for family classification in the wild |
||||||||||||||||||||||
D. Aspandi, O. Martinez and X. Binefa
Proc. 14th International Conference on Automatic Face &
Gesture Recognition, Lille, France, 2019. |
||||||||||||||||||||||
Automatic kinship recognition using Computer Vision,
which aims to infer the blood relationship between individuals by only
comparing their facial features, has started to gain attention recently. The
introduction of large kinship datasets, such as Family In The Wild (FIW), has
allowed large scale dataset modeling using state of the art deep learning
models. Among other kinship recognition tasks, family classification task is
lacking any significant progress due to its increasing difficulty in relation
to the family member size. Furthermore, most current state of-the-art
approaches do not perform any data pre-processing (which try to improve
models accuracy) and are trained without a regularizer
(which results in models susceptible to overfitting). In this paper, we
present the Deep Family Classifier (DFC), a deep learning model for family
classification in the wild. We build our model by combining two sub-networks:
internal Image Feature Enhancer which operates by removing the image noise
and provides an additional facial heatmap layer and Family Class Estimator
trained with strong regularizers and a compound
loss. We observe progressive improvement in accuracy during the validation
phase, with a state of the art results of 16.89% is obtained for the track 2 in
the RFIW2019 challenge and 17.08% of familly
classification task on FIW dataset. |
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Optimizing
Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish |
||||||||||||||||||||||
A. Fernandez-Lopez and F.M. Sukno Computer Vision, Imaging and
Computer Graphics - Theory and Application, Communications in Computer
and Information Science book series, Vol 983, pp. 305–328, 2019 |
||||||||||||||||||||||
Speech is the most used communication method between
humans and it is considered a multisensory process. Even though there is a
popular belief that speech is something that we hear, there is overwhelming
evidence that the brain treats speech as something that we hear and see. Much
of the research has focused on Automatic Speech Recognition (ASR) systems,
treating speech primarily as an acoustic form of communication. In the last
years, there has been an increasing interest in systems for Automatic Lip-Reading
(ALR), although exploiting the visual information has been proved to be
challenging. One of the main problems in ALR is how to make the system robust
to the visual ambiguities that appear at the word level. These ambiguities
make confused and imprecise the definition of the minimum distinguishable
unit of the video domain. In contrast to the audio domain, where the phoneme
is the standard minimum auditory unit, there is no consensus on the
definition of the minimum visual unit (the viseme). In this work, we focus on
the automatic construction of a phoneme-to-viseme mapping based on visual
similarities between phonemes to maximize word recognition. We investigate
the usefulness of different phoneme-to-viseme mappings, obtaining the best
results for intermediate vocabulary lengths. We construct an automatic system
that uses DCT and SIFT descriptors to extract the main characteristics of the
mouth region and HMMs to model the statistic relations of both viseme and
phoneme sequences. We test our system in two Spanish corpora with continuous
speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our
results indicate that we are able to recognize 47% (resp. 51%) of the
phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show
additional results that support the usefulness of visemes. Experiments on a
comparable ALR system trained exclusively using phonemes at all its stages
confirm the existence of strong visual ambiguities between groups of
phonemes. This fact and the higher word accuracy obtained when using
phoneme-to-viseme mappings, justify the usefulness of visemes instead of the
direct use of phonemes for ALR. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
3D head pose estimation
using tensor decomposition and non-linear manifold modeling |
||||||||||||||||||||||
D Derkach, A Ruiz and F.M. Sukno Proc. International Conference on 3D Vision,
Verona, Italy, pp 505–513, 2018. |
||||||||||||||||||||||
Head pose estimation is a challenging computer
vision problem with important applications in different scenarios such as
human-computer interaction or face recognition. In this paper, we present an
algorithm for 3D head pose estimation using only depth information from
Kinect sensors. A key feature of the proposed approach is that it allows
modeling the underlying 3D manifold that results from the combination of
pitch, yaw and roll variations. To do so, we use tensor decomposition to
generate separate subspaces for each variation factor and show that each of
them has a clear structure that can be modeled with cosine functions from a
unique shared parameter per angle. Such representation provides a deep
understanding of data behavior and angle estimations can be performed by
optimizing combination of these cosine functions. We evaluate our approach on
two publicly available databases, and achieve top state-of-the-art
performance. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
Multi-instance dynamic
ordinal random fields for weakly supervised facial behavior analysis |
||||||||||||||||||||||
Adria Ruiz, Ognjen Rudovic, Xavier Binefa, Maja Pantic IEEE
Transactions on Image Processing, 27(8): 3969 – 3982, 2018. |
||||||||||||||||||||||
We propose
a multi-instance-learning (MIL) approach for weakly supervised learning
problems, where a training set is formed by bags (sets of feature vectors or
instances) and only labels at bag-level are provided. Specifically, we
consider the multi-instance dynamic-ordinal-regression (MI-DOR) setting,
where the instance labels are naturally represented as ordinal variables and
bags are structured as temporal sequences. To this end, we propose MI dynamic
ordinal random fields (MI-DORF). In this paper, we treat instance-labels as
temporally dependent latent variables in an undirected graphical model.
Different MIL assumptions are modelled via newly introduced high-order
potentials relating bag and instance-labels within the energy function of the
model. We also extend our framework to address the partially observed MI-DOR
problem, where a subset of instance labels is also available during training.
We show on the tasks of weakly supervised facial action unit and pain
intensity estimation, that the proposed framework outperforms alternative
learning approaches. Furthermore, we show that MI-DORF can be employed to
reduce the data annotation efforts in this context by large-scale. |
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Automatic local shape
spectrum analysis for 3D facial expression recognition |
||||||||||||||||||||||
D. Derkach and F.M. Sukno Image and
Vision Computing, 79: 86–98, 2018. |
||||||||||||||||||||||
We investigate the problem of Facial Expression Recognition
(FER) using 3D data. Building from one of the most successful frameworks for
facial analysis using exclusively 3D geometry, we extend the analysis from a
curve-based representation into a spectral representation, which allows a
complete description of the underlying surface that can be further tuned to
the desired level of detail. Spectral representations are based on the
decomposition of the geometry in its spatial frequency components, much like
a Fourier transform, which are related to intrinsic characteristics of the
surface. In this work, we propose the use of Graph Laplacian Features (GLFs),
which result from the projection of local surface patches into a common basis
obtained from the Graph Laplacian eigenspace. We extract patches around facial
landmarks and include a state-of-the-art localization algorithm to allow for
fully-automatic operation. The proposed approach is tested on the three most
popular databases for 3D FER (BU-3DFE, Bosphorus
and BU-4DFE) in terms of expression and AU recognition. Our results show that
the proposed GLFs consistently outperform the curves-based approach as well
as the most popular alternative for spectral representation, Shape-DNA, which
is based on the Laplace Beltrami Operator and cannot provide a stable basis
that guarantee that the extracted signatures for the different patches are
directly comparable. Interestingly, the accuracy improvement brought by GLFs
is obtained also at a lower computational cost. Considering the extraction of
patches as a common step between the three compared approaches, the
curves-based framework requires a costly elastic deformation between
corresponding curves (e.g. based on splines) and Shape-DNA requires computing
an eigen-decomposition of every new patch to be analyzed. In contrast, GLFs
only require the projection of the patch geometry into the Graph Laplacian
eigenspace, which is common to all patches and can therefore be pre-computed
off-line. We also show that 14 automatically detected landmarks are enough to
achieve high FER and AU detection rates, only slightly below those obtained
when using sets of manually annotated landmarks.. |
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
Manual AU annotations for high-intensity expressions in the
BU-3DFE database |
|||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
A.
Fernandez-Lopez and F.M. Sukno Image and
Vision Computing, 78: 53–72, 2018. |
||||||||||||||||||||||
In the last few years, there has been an increasing
interest in developing systems for Automatic Lip-Reading (ALR). Similarly to
other computer vision applications, methods based on Deep Learning (DL) have
become very popular and have permitted to substantially push forward the
achievable performance. In this survey, we review ALR research during the
last decade, highlighting the progression from approaches previous to DL
(which we refer to as traditional) toward end-to-end DL architectures. We
provide a comprehensive list of the audio-visual databases available for lip-reading,
describing what tasks they can be used for, their popularity and their most
important characteristics, such as the number of speakers, vocabulary size,
recording settings and total duration. In correspondence with the shift
toward DL, we show that there is a clear tendency toward large-scale datasets
targeting realistic application settings and large numbers of samples per
class. On the other hand, we summarize, discuss and compare the different ALR
systems proposed in the last decade, separately considering traditional and
DL approaches. We address a quantitative analysis of the different systems by
organizing them in terms of the task that they target (e.g. recognition of
letters or digits and words or sentences) and comparing their reported performance
in the most commonly used datasets. As a result, we find that DL
architectures perform similarly to traditional ones for simpler tasks but
report significant improvements in more complex tasks, such as word or
sentence recognition, with up to 40% improvement in word recognition rates.
Hence, we provide a detailed description of the available ALR systems based
on end-to-end DL architectures and identify a tendency to focus on the
modeling of temporal context as the key to advance the field. Such modeling
is dominated by recurrent neural networks due to their ability to retain
context at multiple scales (e.g. short- and long-term information). In this
sense, current efforts tend toward techniques that allow a more comprehensive
modeling and interpretability of the retained context. |
||||||||||||||||||||||
|
|
|
||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
A quantitative comparison
of methods for 3D face reconstruction from 2D images. |
||||||||||||||||||||||
A. Morales, G. Piella, O. Martinez and F.M. Sukno Proc. IEEE International Conference on
Automatic Face & Gesture Recognition, Xi’an, China,
2018. |
||||||||||||||||||||||
In the past years, many studies have highlighted the
relation between deviations from normal facial morphology (dysmorphology) and
some genetic and mental disorders. Recent advances in methods for
reconstructing the 3D geometry of the face from 2D images opens new
possibilities for dysmorphology research without the need for specialized 3D
imaging equipment. However, it is unclear whether these methods could
reconstruct the facial geometry with the required accuracy. |
||||||||||||||||||||||
In this paper we present a comparative study of some
of the most relevant approaches for 3D face reconstruction from 2D images,
including photometric-stereo, deep learning and 3D Morphable Model fitting.
We address the comparison in qualitatively and quantitatively terms using a
public database consisting of 2D images and 3D scans from 100 people.
Interestingly, we find that some methods produce quite noisy reconstructions
that do not seem realistic, whereas others look more natural. However, the
latter do not seem to adequately capture the geometric variability that
exists between different subjects and produce reconstructions that look
always very similar across individuals, thus questioning their fidelity. |
|
|||||||||||||||||||||
|
|
|||||||||||||||||||||
|
||||||||||||||||||||||
The research activities in this project have
contributed directly or indirectly to the development of undergrad and
graduate students, as briefly summarized below: |
||||||||||||||||||||||
|
María de Aracerli
Morales, Advanced tools for facial analysis:
application to newborns PhD Thesis, Universitat Pompeu
Fabra. Departament de Tecnologies
de la Informació i les Comunicacions
(In progress, 2018 – 2022), Supervisors:
F. Sukno and G. Piella |
|||||||||||||||||||||
|
Decky Aspandi
Latif, CDeep Spatio-Temporal Neural Network for Facial Analysis PhD Thesis, Universitat Pompeu
Fabra. Departament de Tecnologies
de la Informació i les Comunicacions
(05-03-2021). Supervisor: X. Binefa |
|||||||||||||||||||||
|
Adriana Fernandez-López, Learning of Meaningful Visual Representations for Continuous
Lip-Reading PhD Thesis, Universitat Pompeu
Fabra. Departament de Tecnologies
de la Informació i les Comunicacions
(04-03-2021). Supervisor: F. Sukno |
|||||||||||||||||||||
|
|
|||||||||||||||||||||
|
Dmytro Derkach,
Spectrum analysis methods for
3D facial expression recognition and head pose estimation PhD Thesis, Universitat Pompeu
Fabra. Departament de Tecnologies
de la Informació i les Comunicacions
(03-12-2018). Supervisor: F. Sukno |
|||||||||||||||||||||
|
Nuria Rodriguez Díaz. Lie Detection
based on Deep Learning applied to a collected and annotated Dataset Bachelor Thesis, Double degree in
Audiovisual Systems and Computer Engineering, Universitat
Pompeu Fabra (2020).
Supervisor: X. Binefa. |
|||||||||||||||||||||
|
Antonia Alomar Adrover.
3D fetal face reconstruction from ultrasound imaging Bachelor Thesis, Biomedical Engineering,
Universitat Pompeu Fabra (2020). Supervisors: F. Sukno, G. Piella, A. Morales. |
|||||||||||||||||||||
|
|
|||||||||||||||||||||
|
Mar Ferrer Ferrer.
Multimodal fusion of video signals for remote evaluation of emotional/
cognitive processing Bachelor Thesis, Audiovisual Systems
Engineering, Universitat Pompeu
Fabra (2020). Supervisor: F. Sukno, A. Pereda. |
|||||||||||||||||||||
|
Lie JIn Wang. Control
d'emocions per E-Learning Bachelor Thesis, Audiovisual Systems
Engineering, Universitat Pompeu
Fabra (2020). Supervisors: X. Binefa. |
|||||||||||||||||||||
|
Nadia Cosor Oltra. Detecció d'Anomalies a Partir d'un Rànquing amb Mètodes de Machine Learning No Supervisats Bachelor Thesis, Audiovisual Systems
Engineering, Universitat Pompeu
Fabra (2020). Supervisors: X. Binefa. |
|||||||||||||||||||||
|
Joaquim Comas Martinez. Multimodal
emotion recognition base don facial and physiological
signals and its application in affective computing Master Thesis, Joint Master in Computer
Vision, Auniversitat Autonoma
de Barcelona (2019). Supervisor: X. Binefa. |
|||||||||||||||||||||
|
Gary Stefano Ulloa Rodríguez. Deep
affective computing: automatic recognition of human emotion using facial
features Bachelor Thesis, Computer Engineering, Universitat Pompeu Fabra (2019). Supervisors: O. Martinez and X. Binefa. |
|||||||||||||||||||||
|
Gemma Alaix i Granell. Multimodal
analysis for automatic affect recognition Bachelor Thesis, Audiovisual Systems
Engineering, Universitat Pompeu
Fabra (2019). Supervisors: X. Binefa. |
|||||||||||||||||||||
|
Sergi Solà Casas. Attentional
Mechanism for Affective Computing Bachelor Thesis, Audiovisual Systems
Engineering, Universitat Pompeu
Fabra (2019). Supervisors: X. Binefa. |
|||||||||||||||||||||
|
Guillem Garcia Gómez. Heart rate variability
measuring from facial videos to detect stress Bachelor Thesis, Biomedical Engineering,
Universitat Pompeu Fabra (2018). Supervisors: F. Sukno. |
|||||||||||||||||||||
|
Joaquim
Comas Martinez. Estudi
i anàlisi de senyals cardíaques i la seva relació amb les emocions Bachelor Thesis, Audiovisual Systems
Engineering, Universitat Pompeu
Fabra (2018). Supervisors: X. Binefa. |
|||||||||||||||||||||
|
Paula Catalan Rabaneda.
Lip-Reading Visual Passwords for User Authentication Bachelor Thesis, Audiovisual Systems
Engineering, Universitat Pompeu
Fabra (2018). Supervisors: F. Sukno and A.
Fernandez. |
|||||||||||||||||||||
|
||||||||||||||||||||||
Acknowledgements |
||||||||||||||||||||||
The Principal Investigators, Prof. Xavier Binefa & Dr. Federico Sukno, would like to thank all those
involved in the activities listed above, and to the Ministry of Economy,
Industry and Competitiveness, who funded this project through grant
TIN2017-90124-P |
|
|||||||||||||||||||||
|
||||||||||||||||||||||