UNFACE Project

Projects

UNFACE: Fine grained facial analysis for unmasking hidden information

The human face is a fundamental source of information to understand the behavior of individuals. Traditionally this has been exploited in computer vision for the recognition of identity and expressions but it has been recently suggested that the information that could be extracted from the face goes well beyond this and can be indicative of things such as deception, heart rate, psychological states or even psychiatric disorders such as autism or depression  Some of this information, however, might be not apparent or it might even be hidden to us, and it could only be recovered by means of specialized techniques. An iconic example is the detection of cardiac heart rate by amplifying the subtle color changes of the face due to the blood flow, which are invisible to the human eye.

The goal of the UNFACE project has been to addresses fine grained facial analysis to unmask different sources of information hidden in the face. The project has delivered research results in both fundamental facial analysis algorithms (e.g. landmark localization and tracking, facial expressions, head pose estimation, and facial surface reconstruction) and in a few selected application areas to demonstrate the practical relevance of the developed methods (e.g. affective computing, automatic lip reading, dysmorphology analysis and deception detection),

Among the achievements of the UNFACE project, we highlight 1) the design of advanced deep learning architectures for accuracy and robust tracking of facial landmarks under realistic (in-the-wild) scenarios, for which the resulting models have been made publicly available; 2) the use of spectral decomposition methods to improve the accuracy of facial expression analysis in 3D, as well as to improve dense surface correspondences for 3D facial reconstruction; 3) the creation of the first 3D baby face model, built exclusively from infant facial surfaces within an innovative pipeline based on the aforementioned spectral correspondences that is further capable to automatically derive the model template instead of requiring a pre-existing one as commonly required by other state of the art methods; 4) the development of a database for lie detection based on a competitive game scenario that promotes the frequent and motivated use of lies by the participants, recorded with multiple cameras that provide both 2D and 3D information of the participant faces; 5) the development of data-driven representations that make possible continuous lip reading in Spanish, and potentially also in other languages without the need of replicating the huge data resources needed to train other state of the art lip reading systems, which in practice constrain their applicability only to English.

Principal Investigators: Xavier Binefa & Federico Sukno

This project was funded by the 2017 call from “Programa Estatal de Fomento de la Investigación Científica y Técnica de Excelencia” from the Spanish Ministry of Economy, Industry and Competitiveness.

 

Index of Project Results:

1.      Composite recurrent network with internal denoising for facial alignment in still and video images in the wild, Image and Vision Computing, 2021.

2.      Survey on 3D face reconstruction from uncalibrated images, Computer Science Review, 2021.

3.      3D Fetal Face Reconstruction from Ultrasound Imaging, GRAPP 2021.

4.      An Enhanced Adversarial Network with Combined Latent Features for Spatio-Temporal Facial Affect Estimation in the Wild, VISAPP 2021.

5.      Spectral Correspondence Framework for Building a 3D Baby Face Model , FG 2020.

6.      End-to-end facial and physiological model for Affective Computing and applications, FG 2020.

7.      Refining the resolution of craniofacial dysmorphology in bipolar disorder as an index of brain dysmorphogenesis, Psychiatry Research, 291: 113243, 2020.

8.      Cogans For Unsupervised Visual Speech Adaptation to New Speakers, ICASSP 2020.

9.      Tensor Decomposition and Non-linear Manifold Modeling for 3D Head Pose Estimation, International Journal of Computer Vision, 127(10): 1565–1585, 2019.

10.   Three-Dimensional Face Reconstruction from Uncalibrated Photographs: Application to Early Detection of Genetic Syndromes, MICCAI CLIP 2019.

11.   Robust facial alignment with internal denoising auto-encoder, CRV 2019.

12.   Lip-Reading with Limited-Data Network, EUSIPCO 2019.

13.   Fully end-to-end composite recurrent convolution network for deformable facial tracking in the wild, FG 2019.

14.   Heatmap-guided balanced deep convolution networks for family classification in the wild, FG 2019.

15.   Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish, Communication in Computer and Information Science, 2019.

16.   Multi-instance dynamic ordinal random fields for weakly supervised facial behavior analysis, IEEE Transactions on Image Processing, 27(8): 3969–3982, 2018.

17.   Automatic local shape spectrum analysis for 3D facial expression recognition, Image and Vision Computing, 79: 86–98, 2018.

18.   3D head pose estimation using tensor decomposition and non-linear manifold modeling, 3DV 2018.

19.   Survey on Automatic Lip-Reading in the Era of Deep Learning, Image and Vision Computing, 78: 53–72, 2018.

20.   A quantitative comparison of methods for 3D face reconstruction from 2D images, FG 2018.

 

Further activities related to the project

·        4 PhD Theses

·        12 Final bachelor/master projects

 

Composite recurrent network with internal denoising for facial alignment in still and video images in the wild

Diagram

Description automatically generated with low confidence

D. Aspandi, O. Martinez, F.M. Sukno and X. Binefa

Image and Vision Computing, 111(7): 104189, 2021.

The Facial alignment is an essential task for many higher level facial analysis applications, such as animation, human activity recognition and human - computer interaction. Although the recent availability of big datasets and powerful deep-learning approaches have enabled major improvements on the state of the art accuracy, the performance of current approaches can severely deteriorate when dealing with images in highly unconstrained conditions, which limits the real-life applicability of such models. In this paper, we propose a composite recurrent tracker with internal denoising that jointly address both single image facial alignment and deformable facial tracking in the wild. Specifically, we incorporate multilayer LSTMs to model temporal dependencies with variable length and introduce an internal denoiser which selectively enhances the input images to improve the robustness of our overall model. We achieve this by combining 4 different sub-networks that specialize in each of the key tasks that are required, namely face detection, bounding-box tracking, facial region validation and facial alignment with internal denoising. These blocks are endowed with novel algorithms resulting in a facial tracker that is both accurate, robust to in-the-wild settings and resilient against drifting. We demonstrate this by testing our model on 300-W and Menpo datasets for single image facial alignment, and 300-VW dataset for deformable facial tracking. Comparison against 20 other state of the art methods demonstrates the excellent performance of the proposed approach.

 

 

Fig. 3

A picture containing diagram

Description automatically generated

Full paper

 

 

 

 

 

 

 

Survey on 3D face reconstruction from uncalibrated images

Diagram

Description automatically generated with low confidence

A. Morales, G. Piella and F.M. Sukno

Computer Science Review, 40(5): 100400, 2021.

Recently, a lot of attention has been focused on the incorporation of 3D data into face analysis and its applications. Despite providing a more accurate representation of the face, 3D facial images are more complex to acquire than 2D pictures. As a consequence, great effort has been invested in developing systems that reconstruct 3D faces from an uncalibrated 2D image. However, the 3D-from-2D face reconstruction problem is ill-posed, thus prior knowledge is needed to restrict the solutions space. In this work, we review 3D face reconstruction methods proposed in the last decade, focusing on those that only use 2D pictures captured under uncontrolled conditions. We present a classification of the proposed methods based on the technique used to add prior knowledge, considering three main strategies, namely, statistical model fitting, photometry, and deep learning, and reviewing each of them separately. In addition, given the relevance of statistical 3D facial models as prior knowledge, we explain the construction procedure and provide a list of the most popular publicly available 3D facial models. After the exhaustive study of 3D-from-2D face reconstruction approaches, we observe that the deep learning strategy is rapidly growing since the last few years, becoming the standard choice in replacement of the widespread statistical model fitting. Unlike the other two strategies, photometry-based methods have decreased in number due to the need for strong underlying assumptions that limit the quality of their reconstructions compared to statistical model fitting and deep learning methods. The review also identifies current challenges and suggests avenues for future research.

 

 

Chart, bar chart

Description automatically generated

A picture containing diagram

Description automatically generated

Full paper

 

 

 

3D Fetal Face Reconstruction from Ultrasound Imaging

A. Alomar, A. Morales, K. Vellve, A.R. Porras, F. Crispi, M.G. Linguraru, G. Piella and F.M. Sukno

Proc. 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Vol 4 VISAPP, pp. 615–624, 2021.

The fetal face contains essential information in the evaluation of congenital malformations and the fetal brain function, as its development is driven by genetic factors at early stages of embryogenesis. Three-dimensional ultrasound (3DUS) can provide information about the facial morphology of the fetus, but its use for prenatal diagnosis is challenging due to imaging noise, fetal movements, limited field-of-view, low soft-tissue contrast, and occlusions. In this paper, we propose a fetal face reconstruction algorithm from 3DUS images based on a novel statistical morphable model of newborn faces, the BabyFM.

We test the feasibility of using newborn statistics to accurately reconstruct fetal faces by fitting the regularized morphable model to the noisy 3DUS images. The algorithm is capable of reconstructing the whole facial morphology of babies from one or several ultrasound scans to handle adverse conditions (e.g. missing parts, noisy data), and it has the potential to aid in-utero diagnosis for conditions that involve facial dysmorphology.

Full paper

Icon

Description automatically generated

Presentation (video)

 

An Enhanced Adversarial Network with Combined Latent Features for Spatio-Temporal Facial Affect Estimation in the Wild

D. Aspandi, F.M. Sukno, B. Schuller and X. Binefa

Proc. 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, (Online & Streaming), Vol 4 VISAPP, pp. 172–181, 2021.

We Affective Computing has recently attracted the attention of the research community, due to its numerous applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling often results in very high-dimensional feature spaces and large volumes of data, making training difficult and time consuming. This paper addresses these shortcomings by proposing a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner, which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention modules. In our experiments, we show the effectiveness of our approach by reporting our competitive results on both the AFEW-VA and SEWA datasets, suggesting that temporal modelling improves the affect estimates both in qualitative and quantitative terms. Furthermore, we find that the inclusion of attention mechanisms leads to the highest accuracy improvements, as its weights seem to correlate well with the appearance of facial movements, both in terms of temporal localisation and intensity. Finally, we observe the sequence length of around 160\,ms to be the optimum one for temporal modelling, which is consistent with other relevant findings utilising similar lengths.

 

 

Full paper

Source code with full definitions of our models

 

Spectral Correspondence Framework for Building a 3D Baby Face Model

A. Morales, A.R. Porras, L. Tu, M.G. Linguraru, G. Piella and F.M. Sukno

Proc. 15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Argentina, pp. 507–514, 2020.

Early detection of facial dysmorphology –variations of the normal facial geometry- is essential for the timely detection of genetic conditions, which has a significant impact in the reduction of the mortality and morbidity associated with them. A model encoding the normal variability in the healthy population can serve as a reference to quantify the often subtle facial abnormalities that are present in young patients with such conditions.

In this paper, we present the first facial model constructed exclusively from newborn data, the Baby Face Model (BabyFM). Our model is built from 3D scans with an innovative pipeline based on least squared conformal maps (LSCM). LSCM are piece-wise linear mappings that project the training faces to a common 2D space minimising the conformal distortion. This process allows improving the correspondences between 3D faces, which is particularly important for the identification of subtle dysmorphology. We evaluate the ability of our BabyFM to recover the babys facial morphology from a set of 2D images by comparing it to state-of-the-art facial models. We also compare it to models built following an analogous pipeline to the one proposed in this paper but using nonrigid iterative closest point (NICP) to establish dense correspondences between the training faces. The results show that our model reconstructs the facial morphology of babies with significantly smaller errors than the state-of-the-art models (p < 10-4 and the “NICP models” (p < 0:01).

 

A close up of a person's face

Description automatically generated with medium confidence

 

 

Full paper

Presentation Slides

 

 

 

 

 

End-to-end facial and physiological model for Affective Computing and applications

J. Comas, D. Aspandi and X. Binefa

Proc. 15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Argentina, pp. 507–514, 2020.

In recent years, affective computing and its applications have become a fast-growing research topic. Furthermore, the rise of deep learning has introduced significant improvements in the emotion recognition system compared to classical methods. In this work, we propose a multi-modal emotion recognition model based on deep learning techniques using the combination of peripheral physiological signals and facial expressions. Moreover, we present an improvement to proposed models by introducing latent features extracted from our internal Bio Auto-Encoder (BAE). Both models are trained and evaluated on AMIGOS datasets reporting valence, arousal, and emotion state classification. Finally, to demonstrate a possible medical application in affective computing using deep learning techniques, we applied the proposed method to the assessment of anxiety therapy. To this purpose, a reduced multi-modal database has been collected by recording facial expressions and peripheral signals such as electrocardiogram (ECG) and galvanic skin response (GSR) of each patient. Valence and arousal estimates were extracted using our proposed model across the duration of the therapy, with successful evaluation to the different emotional changes in the temporal domain.

 

 

Full paper

 

 

 

Cogans For Unsupervised Visual Speech Adaptation to New Speakers

A. Fernandez-Lopez, A. Karaali, N, Harte and F.M. Sukno

Proc.  IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6294-6298, 2020.

Audio-Visual Speech Recognition (AVSR) faces the difficult task of exploiting acoustic and visual cues simultaneously. Augmenting speech with the visual channel creates its own challenges, e.g. every person has unique mouth movements, making the generalization of visual models very difficult. This factor motivates our focus on the generalization of speaker-independent (SI) AVSR systems especially in noisy environments by exploiting the visual domain. Specifically, we are the first to explore the visual adaptation of an SI-AVSR system to an unknown and unlabelled speaker. We adapt an AVSR system trained in a source domain to decode samples in a target domain without the need for labels in the target domain. For the domain adaptation of the unknown speaker, we use Coupled Generative Adversarial Networks to automatically learn a joint distribution of multi-domain images. We evaluate our character-based AVSR system on the TCD-TIMIT dataset and obtain up to a 10% average improvement with respect to its AVSR system equivalent.

 

 

Full paper

Presentation (video)

 

 

 

Refining the resolution of craniofacial dysmorphology in bipolar disorder as an index of brain dysmorphogenesis

S. Katina, B.D. Kelly, M.A. Rojas, F.M. Sukno, A. McDermott, R.J. Hennessy, A. Lane, P.F. Whelan, A.W. Bowman and J.L. Waddington

Psychiatry Research, 291: 113243, 2020.

As understanding of the genetics of bipolar disorder increases, controversy endures regarding whether the origins of this illness include early maldevelopment. Clarification would be facilitated by a ‘hard’ biological index of fetal developmental abnormality, among which craniofacial dysmorphology bears the closest embryological relationship to brain dysmorphogenesis. Therefore, 3D laser surface imaging was used to capture the facial surface of 21 patients with bipolar disorder and 45 control subjects; 21 patients with schizophrenia were also studied.

Surface images were subjected to geometric morphometric analysis in non-affine space for more incisive resolution of subtle, localised dysmorphologies that might distinguish patients from controls. Complex and more biologically informative, non-linear changes distinguished bipolar patients from control subjects. On a background of minor dysmorphology of the upper face, maxilla, midface and periorbital regions, bipolar disorder was characterised primarily by the following dysmorphologies: (a) retrusion and shortening of the premaxilla, nose, philtrum, lips and mouth (the frontonasal prominences), with (b) some protrusion and widening of the mandible-chin. The topography of facial dysmorphology in bipolar disorder indicates disruption to early development in the frontonasal process and, on embryological grounds, cerebral dysmorphogenesis in the forebrain, most likely between the 10th and 15th week of fetal life.

Full paper

Radio Coverage

 

Tensor Decomposition and Non-linear Manifold Modeling for 3D Head Pose Estimation

D. Derkach, A. Ruiz and F.M. Sukno

International Journal of Computer Vision, 127(10): 1565–1585, 2019.

We Head pose estimation is a challenging computer vision problem with important applications in different scenarios such as human–computer interaction or face recognition. In this paper, we present a 3D head pose estimation algorithm based on non-linear manifold learning. A key feature of the proposed approach is that it allows modeling the underlying 3D manifold that results from the combination of rotation angles. To do so, we use tensor decomposition to generate separate subspaces for each variation factor and show that each of them has a clear structure that can be modeled with cosine functions from a unique shared parameter per angle. Such representation provides a deep understanding of data behavior. We show that the proposed framework can be applied to a wide variety of input features and can be used for different purposes. Firstly, we test our system on a publicly available database, which consists of 2D images and we show that the cosine functions can be used to synthesize rotated versions from an object from which we see only a 2D image at a specific angle. Further, we perform 3D head pose estimation experiments using other two types of features: automatic landmarks and histogram-based 3D descriptors. We evaluate our approach on two publicly available databases, and demonstrate that angle estimations can be performed by optimizing the combination of these cosine functions to achieve state-of-the-art performance.

 

 

 

Tensor decompositon of multi-view data yields manifold subspaces whos components follow trigonometric curves

 

 

 

Full paper

 

Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish

Diagram

Description automatically generated with low confidence

A. Fernandez-Lopez and F.M. Sukno

Computer Vision, Imaging and Computer Graphics - Theory and Application, Communications in Computer and Information Science book series, Vol 983, pp. 305–328, 2019

Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.

 

 

A picture containing diagram

Description automatically generated

Full paper

 

 

 

 

 

Three-Dimensional Face Reconstruction from Uncalibrated Photographs: Application to Early Detection of Genetic Syndromes

L. Tu, A.R. Porras, M.C. A. Morales, D.A. Perez, G. Piella, F.M. Sukno and M.G. Linguraru

Proc. 8th MICCAI Clinical Image-based Procedures Workshop, Shenzhen, China, pp. 182–189, 2019.

Facial analysis from photography supports the early identification of genetic syndromes, but clinically-acquired uncalibrated images suffer from image pose and illumination variability. Although 3D photography overcomes some of the challenges of 2D images, 3D scanners are not typically available. We present an optimization method for 3D face reconstruction from uncalibrated 2D photographs of the face using a novel statistical shape model of the infant face. First, our method creates an initial estimation of the camera pose for each 2D photograph using the average shape of the statistical model and a set of 2D facial landmarks. Second, it calculates the camera pose and the parameters of the statistical model by minimizing the distance between the projection of the estimated 3D face in the image plane of each camera and the observed 2D face geometry. Using the reconstructed 3D faces, we automatically extract a set of 3D geometric and appearance descriptors and we use them to train a classifier to identify facial dysmorphology associated with genetic syndromes. We evaluated our face reconstruction method on 3D photographs of 54 subjects (age range 0–3 years), and we obtained a point-to-surface error of 2.01 ± 0.54%, which was a significant improvement over 2.98 ± 0.64% using state-of-the-art methods (p < 0.001). Our classifier detected genetic syndromes from the reconstructed 3D faces from the 2D photographs with 100% sensitivity and 92.11% specificity.

 

 

Full paper

TV Coverage

 

 

 

Robust facial alignment with internal denoising auto-encoder

D. Aspandi, O. Martinez, F.M. Sukno and X. Binefa

Proc. 16th Conference on Computer & Robot Vision, Ontario, Canada, pp. 143–150, 2019.

The development of facial alignment models is growing rapidly thanks to the availability of large facial landmarked datasets and powerful deep learning models. However, important challenges still remain for facial alignment models to work on images under extreme conditions, such as severe occlusions or large variations in pose and illumination. Current attempts to overcome this limitation have mainly focused on building robust feature extractors with the assumption that the model will be able to discard the noise and select only the meaningful features. However, such an assumption ignores the importance of understanding the noise that characterizes unconstrained images, which has been shown to benefit computer vision models if used appropriately on the learning strategy. Thus, in this paper we investigate the introduction of specialized modules for noise detection and removal, in combination with our state-of-the-art facial alignment module and show that this leads to improved robustness both to synthesized noise and in-the-wild conditions. The proposed model is built by combining two major subnetworks: internal image denoiser (based on the Auto-Encoder architecture) and facial landmark localiser (based on the inception-resnet architecture). Our results on the 300-W and Menpo datasets show that our model can effectively handle different types of synthetic noise, which also leads to enhanced robustness in real-world unconstrained settings, reaching top state-of-the-art accuracy.

 

 

Full paper

Source code of the modified Denoising AutoEncoder

Source code of our modified Inception ResNet

 

 

 

Lip-Reading with Limited-Data Network

A. Fernandez-Lopez and F.M. Sukno

Proc. 27th European Signal Processing Conference, A Coruña, Spain, 2019.

The development of Automatic Lip-Reading (ALR) systems is currently dominated by Deep Learning (DL) approaches. However, DL systems generally face two main issues related to the amount of data and the complexity of the model. To find a balance between the amount of available training data and the number of parameters of the model, in this work we introduce an end-to-end ALR system that combines CNNs and LSTMs and can be trained without large-scale databases. To this end, we propose to split the training by modules, by automatically generating weak labels per frames, termed visual units. These weak visual units are representative enough to guide the CNN to extract meaningful features that when combined with the context provided by the temporal module, are sufficiently informative to train an ALR system in a very short time and with no need for manual labeling. The system is evaluated in the well-known OuluVS2 database to perform sentence-level classification. We obtain an accuracy of 91.38% which is comparable to state-of the-art results but, differently from most previous approaches, we do not require the use of external training data.

 

 

Full paper

 

 

 

Fully end-to-end composite recurrent convolution network for deformable facial tracking in the wild

D. Aspandi, O. Martinez, F.M. Sukno and X. Binefa

Proc. 14th International Conference on Automatic Face & Gesture Recognition, Lille, France, 2019.

Human facial tracking is an important task in computer vision, which has recently lost pace compared to other facial analysis tasks. The majority of current available tracker possess two major limitations: their little use of temporal information and the widespread use of handcrafted features, without taking full advantage of the large annotated datasets that have recently become available. In this paper we present a fully end-to-end facial tracking model based on current state of the art deep model architectures that can be effectively trained from the available annotated facial landmark datasets. We build our model from the recently introduced general object tracker Re3, which allows modeling the short and long temporal dependency between frames by means of its internal Long Short Term Memory (LSTM) layers. Facial tracking experiments on the challenging 300-VW dataset show that our model can produce state of the art accuracy and far lower failure rates than competing approaches. We specifically compare the performance of our approach modified to work in tracking-by-detection mode and showed that, as such, it can produce results that are comparable to state of the art trackers. However, upon activation of our tracking mechanism, the results improve significantly, confirming the advantage of taking into account temporal dependencies.

Full paper

Pre-trained models and results

 

Heatmap-guided balanced deep convolution networks for family classification in the wild

D. Aspandi, O. Martinez and X. Binefa

Proc. 14th International Conference on Automatic Face & Gesture Recognition, Lille, France, 2019.

Automatic kinship recognition using Computer Vision, which aims to infer the blood relationship between individuals by only comparing their facial features, has started to gain attention recently. The introduction of large kinship datasets, such as Family In The Wild (FIW), has allowed large scale dataset modeling using state of the art deep learning models. Among other kinship recognition tasks, family classification task is lacking any significant progress due to its increasing difficulty in relation to the family member size. Furthermore, most current state of-the-art approaches do not perform any data pre-processing (which try to improve models accuracy) and are trained without a regularizer (which results in models susceptible to overfitting). In this paper, we present the Deep Family Classifier (DFC), a deep learning model for family classification in the wild. We build our model by combining two sub-networks: internal Image Feature Enhancer which operates by removing the image noise and provides an additional facial heatmap layer and Family Class Estimator trained with strong regularizers and a compound loss. We observe progressive improvement in accuracy during the validation phase, with a state of the art results of 16.89% is obtained for the track 2 in the RFIW2019 challenge and 17.08% of familly classification task on FIW dataset.

Full paper

Pre-trained models

 

Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish

Diagram

Description automatically generated with low confidence

A. Fernandez-Lopez and F.M. Sukno

Computer Vision, Imaging and Computer Graphics - Theory and Application, Communications in Computer and Information Science book series, Vol 983, pp. 305–328, 2019

Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.

 

 

Application

Description automatically generated with low confidence

A picture containing diagram

Description automatically generated

Full paper

 

 

 

 

 

3D head pose estimation using tensor decomposition and non-linear manifold modeling

D Derkach, A Ruiz and F.M. Sukno

Proc. International Conference on 3D Vision, Verona, Italy, pp 505–513, 2018.

Head pose estimation is a challenging computer vision problem with important applications in different scenarios such as human-computer interaction or face recognition. In this paper, we present an algorithm for 3D head pose estimation using only depth information from Kinect sensors. A key feature of the proposed approach is that it allows modeling the underlying 3D manifold that results from the combination of pitch, yaw and roll variations. To do so, we use tensor decomposition to generate separate subspaces for each variation factor and show that each of them has a clear structure that can be modeled with cosine functions from a unique shared parameter per angle. Such representation provides a deep understanding of data behavior and angle estimations can be performed by optimizing combination of these cosine functions. We evaluate our approach on two publicly available databases, and achieve top state-of-the-art performance.

 

 

Graphical user interface

Description automatically generated

Full paper

 

 

 

Multi-instance dynamic ordinal random fields for weakly supervised facial behavior analysis

Adria Ruiz, Ognjen Rudovic, Xavier Binefa, Maja Pantic

IEEE Transactions on Image Processing, 27(8): 3969 – 3982, 2018.

 

We propose a multi-instance-learning (MIL) approach for weakly supervised learning problems, where a training set is formed by bags (sets of feature vectors or instances) and only labels at bag-level are provided. Specifically, we consider the multi-instance dynamic-ordinal-regression (MI-DOR) setting, where the instance labels are naturally represented as ordinal variables and bags are structured as temporal sequences. To this end, we propose MI dynamic ordinal random fields (MI-DORF). In this paper, we treat instance-labels as temporally dependent latent variables in an undirected graphical model. Different MIL assumptions are modelled via newly introduced high-order potentials relating bag and instance-labels within the energy function of the model. We also extend our framework to address the partially observed MI-DOR problem, where a subset of instance labels is also available during training. We show on the tasks of weakly supervised facial action unit and pain intensity estimation, that the proposed framework outperforms alternative learning approaches. Furthermore, we show that MI-DORF can be employed to reduce the data annotation efforts in this context by large-scale.

Full paper

 

Automatic local shape spectrum analysis for 3D facial expression recognition

D. Derkach and F.M. Sukno

Image and Vision Computing, 79: 86–98, 2018.

We investigate the problem of Facial Expression Recognition (FER) using 3D data. Building from one of the most successful frameworks for facial analysis using exclusively 3D geometry, we extend the analysis from a curve-based representation into a spectral representation, which allows a complete description of the underlying surface that can be further tuned to the desired level of detail. Spectral representations are based on the decomposition of the geometry in its spatial frequency components, much like a Fourier transform, which are related to intrinsic characteristics of the surface. In this work, we propose the use of Graph Laplacian Features (GLFs), which result from the projection of local surface patches into a common basis obtained from the Graph Laplacian eigenspace. We extract patches around facial landmarks and include a state-of-the-art localization algorithm to allow for fully-automatic operation. The proposed approach is tested on the three most popular databases for 3D FER (BU-3DFE, Bosphorus and BU-4DFE) in terms of expression and AU recognition. Our results show that the proposed GLFs consistently outperform the curves-based approach as well as the most popular alternative for spectral representation, Shape-DNA, which is based on the Laplace Beltrami Operator and cannot provide a stable basis that guarantee that the extracted signatures for the different patches are directly comparable. Interestingly, the accuracy improvement brought by GLFs is obtained also at a lower computational cost. Considering the extraction of patches as a common step between the three compared approaches, the curves-based framework requires a costly elastic deformation between corresponding curves (e.g. based on splines) and Shape-DNA requires computing an eigen-decomposition of every new patch to be analyzed. In contrast, GLFs only require the projection of the patch geometry into the Graph Laplacian eigenspace, which is common to all patches and can therefore be pre-computed off-line. We also show that 14 automatically detected landmarks are enough to achieve high FER and AU detection rates, only slightly below those obtained when using sets of manually annotated landmarks..

Full paper

Manual AU annotations for high-intensity expressions in the BU-3DFE database

 

 

Survey on Automatic Lip-Reading in the Era of Deep Learning

A. Fernandez-Lopez and F.M. Sukno

Image and Vision Computing, 78: 53–72, 2018.

In the last few years, there has been an increasing interest in developing systems for Automatic Lip-Reading (ALR). Similarly to other computer vision applications, methods based on Deep Learning (DL) have become very popular and have permitted to substantially push forward the achievable performance. In this survey, we review ALR research during the last decade, highlighting the progression from approaches previous to DL (which we refer to as traditional) toward end-to-end DL architectures. We provide a comprehensive list of the audio-visual databases available for lip-reading, describing what tasks they can be used for, their popularity and their most important characteristics, such as the number of speakers, vocabulary size, recording settings and total duration. In correspondence with the shift toward DL, we show that there is a clear tendency toward large-scale datasets targeting realistic application settings and large numbers of samples per class. On the other hand, we summarize, discuss and compare the different ALR systems proposed in the last decade, separately considering traditional and DL approaches. We address a quantitative analysis of the different systems by organizing them in terms of the task that they target (e.g. recognition of letters or digits and words or sentences) and comparing their reported performance in the most commonly used datasets. As a result, we find that DL architectures perform similarly to traditional ones for simpler tasks but report significant improvements in more complex tasks, such as word or sentence recognition, with up to 40% improvement in word recognition rates. Hence, we provide a detailed description of the available ALR systems based on end-to-end DL architectures and identify a tendency to focus on the modeling of temporal context as the key to advance the field. Such modeling is dominated by recurrent neural networks due to their ability to retain context at multiple scales (e.g. short- and long-term information). In this sense, current efforts tend toward techniques that allow a more comprehensive modeling and interpretability of the retained context.

 

 

Full paper

A picture containing diagram

Description automatically generated

Supplementary Material

 

 

 

A quantitative comparison of methods for 3D face reconstruction from 2D images.

A. Morales, G. Piella, O. Martinez and F.M. Sukno

Proc. IEEE International Conference on Automatic Face & Gesture Recognition, Xi’an, China, 2018.

In the past years, many studies have highlighted the relation between deviations from normal facial morphology (dysmorphology) and some genetic and mental disorders. Recent advances in methods for reconstructing the 3D geometry of the face from 2D images opens new possibilities for dysmorphology research without the need for specialized 3D imaging equipment. However, it is unclear whether these methods could reconstruct the facial geometry with the required accuracy.

In this paper we present a comparative study of some of the most relevant approaches for 3D face reconstruction from 2D images, including photometric-stereo, deep learning and 3D Morphable Model fitting. We address the comparison in qualitatively and quantitatively terms using a public database consisting of 2D images and 3D scans from 100 people. Interestingly, we find that some methods produce quite noisy reconstructions that do not seem realistic, whereas others look more natural. However, the latter do not seem to adequately capture the geometric variability that exists between different subjects and produce reconstructions that look always very similar across individuals, thus questioning their fidelity.

Full paper

 

 

Further activities related to the project

The research activities in this project have contributed directly or indirectly to the development of undergrad and graduate students, as briefly summarized below:

 

María de Aracerli Morales,  Advanced tools for facial analysis: application to newborns

PhD Thesis, Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions (In progress, 2018 – 2022), Supervisors: F. Sukno and G. Piella

 

Decky Aspandi Latif, CDeep Spatio-Temporal Neural Network for Facial Analysis

PhD Thesis, Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions (05-03-2021). Supervisor: X. Binefa

 

Adriana Fernandez-López, Learning of Meaningful Visual Representations for Continuous Lip-Reading

PhD Thesis, Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions (04-03-2021). Supervisor: F. Sukno

 

Defense recording

Dmytro Derkach, Spectrum analysis methods for 3D facial expression recognition and head pose estimation

PhD Thesis, Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions (03-12-2018). Supervisor: F. Sukno

 

Nuria Rodriguez Díaz. Lie Detection based on Deep Learning applied to a collected and annotated Dataset

Bachelor Thesis, Double degree in Audiovisual Systems and Computer Engineering, Universitat Pompeu Fabra (2020). Supervisor: X. Binefa.

 

Antonia Alomar Adrover. 3D fetal face reconstruction from ultrasound imaging

Bachelor Thesis, Biomedical Engineering, Universitat Pompeu Fabra (2020). Supervisors: F. Sukno, G. Piella, A. Morales.

 

 

Mar Ferrer Ferrer. Multimodal fusion of video signals for remote evaluation of emotional/ cognitive processing

Bachelor Thesis, Audiovisual Systems Engineering, Universitat Pompeu Fabra (2020). Supervisor: F. Sukno, A. Pereda.

 

Lie JIn Wang. Control d'emocions per E-Learning

Bachelor Thesis, Audiovisual Systems Engineering, Universitat Pompeu Fabra (2020). Supervisors: X. Binefa.

 

Nadia Cosor Oltra. Detecció d'Anomalies a Partir d'un Rànquing amb Mètodes de Machine Learning No Supervisats

Bachelor Thesis, Audiovisual Systems Engineering, Universitat Pompeu Fabra (2020). Supervisors: X. Binefa.

 

Icon

Description automatically generated

Joaquim Comas Martinez. Multimodal emotion recognition base don facial and physiological signals and its application in affective computing

Master Thesis, Joint Master in Computer Vision, Auniversitat Autonoma de Barcelona (2019). Supervisor: X. Binefa.

 

Gary Stefano Ulloa Rodríguez. Deep affective computing: automatic recognition of human emotion using facial features

Bachelor Thesis, Computer Engineering, Universitat Pompeu Fabra (2019). Supervisors: O. Martinez and X. Binefa.

 

Gemma Alaix i Granell. Multimodal analysis for automatic affect recognition

Bachelor Thesis, Audiovisual Systems Engineering, Universitat Pompeu Fabra (2019). Supervisors: X. Binefa.

 

Sergi Solà Casas. Attentional Mechanism for Affective Computing

Bachelor Thesis, Audiovisual Systems Engineering, Universitat Pompeu Fabra (2019). Supervisors: X. Binefa.

 

Guillem Garcia Gómez. Heart rate variability measuring from facial videos to detect stress

Bachelor Thesis, Biomedical Engineering, Universitat Pompeu Fabra (2018). Supervisors: F. Sukno.

 

Joaquim Comas Martinez. Estudi i anàlisi de senyals cardíaques i la seva relació amb les emocions

Bachelor Thesis, Audiovisual Systems Engineering, Universitat Pompeu Fabra (2018). Supervisors: X. Binefa.

 

Paula Catalan Rabaneda. Lip-Reading Visual Passwords for User Authentication

Bachelor Thesis, Audiovisual Systems Engineering, Universitat Pompeu Fabra (2018). Supervisors: F. Sukno and A. Fernandez.

 

Acknowledgements

The Principal Investigators, Prof. Xavier Binefa & Dr. Federico Sukno, would like to thank all those involved in the activities listed above, and to the Ministry of Economy, Industry and Competitiveness, who funded this project through grant TIN2017-90124-P