Research

Research Highlights

· Survey on 3D face reconstruction from uncalibrated images, Computer Science Review, 2021.

· 3D Fetal Face Reconstruction from Ultrasound Imaging, VISAPP 2021.

· An Enhanced Adversarial Network with Combined Latent Features for Spatio-Temporal Facial Affect Estimation in the Wild, VISAPP 2021.

· Spectral Correspondence Framework for Building a 3D Baby Face Model, FG 2020.

· Refining the resolution of craniofacial dysmorphology in bipolar disorder as an index of brain dysmorphogenesis, Psychiatry Research, 291: 113243, 2020.

· Cogans For Unsupervised Visual Speech Adaptation to New Speakers, ICASSP 2020.

· Tensor Decomposition and Non-linear Manifold Modeling for 3D Head Pose Estimation, International Journal of Computer Vision, 127(10): 1565–1585, 2019.

· Three-Dimensional Face Reconstruction from Uncalibrated Photographs: Application to Early Detection of Genetic Syndromes, MICCAI CLIP 2019.

· Robust facial alignment with internal denoising auto-encoder, CRV 2019.

· Lip-Reading with Limited-Data Network, EUSIPCO 2019.

· Fully end-to-end composite recurrent convolution network for deformable facial tracking in the wild, FG 2019.

· Automatic local shape spectrum analysis for 3D facial expression recognition, Image and Vision Computing, 79: 86–98, 2018.

· 3D head pose estimation using tensor decomposition and non-linear manifold modeling, 3DV 2018.

· Survey on Automatic Lip-Reading in the Era of Deep Learning, Image and Vision Computing, 78: 53–72, 2018.

· A quantitative comparison of methods for 3D face reconstruction from 2D images, FG 2018.

· The Visual Lip Reading Feasibility Database – Freely available for research.

· Head Pose Estimation Based on 3-D Facial Landmarks Localization and Regression, FG 2017 – Winner of the Head Pose Estimation Challenge.

· Local Shape Spectrum Analysis for 3D Facial Expression Recognition, FG 2017.

· Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database, FG 2017

· Fusion of Valence and Arousal Annotations through Dynamic Subjective Ordinal Modelling, FG 2017

· Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading, VISAPP 2017

· SRILF 3D Face Landmarker – Free research software, v1.0, 2016.

· A multimodal annotation schema for non-verbal affective analysis in the health-care domain, MARMI 2016

· On the Quantitative Analysis of Craniofacial Asymmetry in 3D, FG 2015

· 3D Facial Landmark Localization with Asymmetry Patterns and Shape Regression from Incomplete Local Features, IEEE Trans on Cybernetics, 45 (9): 1717-1730, 2015.

· Asymmetry Patterns Shape Contexts to Describe the 3D Geometry of Craniofacial Landmarks, Communications in Computer and Information Science, 458:19-35, 2014.

· Live 3D facial scanning and landmark detection using Shape Regression with Incomplete Local Features, FG 2013

· Compensating inaccurate annotations to train 3D facial landmark localization models, Proc. Workshop on 3D Face Biometrics 2013

· Rotationally Invariant 3D Shape Contexts Using Asymmetry Patterns, Proc. GRAPP 2013 – Best paper award

UNFACE project

Most of the activity in 2018 – 2020 is within this project

The UNFACE project addresses fine grained facial analysis with the goal of decoding hidden facial information. The human face is a fundamental source of information to understand the behavior of individuals. Traditionally this has been exploited in computer vision for the recognition of identity and expressions but it has been recently suggested that the information that could be extracted from the face goes well beyond this and can be indicative of things such as deception, heart rate, psychological states or even psychiatric disorders such as autism or depression.

Some of this information, however, might be not apparent or it might even be hidden to us, and it could only be recovered by means of specialized techniques. An iconic example is the detection of cardiac heart rate by amplifying the subtle color changes of the face due to the blood flow, which are invisible to the human eye. For more information, please visit the project page

http://fsukno.atspace.eu/Unface.htm

Most of the research from our group during the period between 2018 – 2020 is framed within the UNFACE project, which has directly or indirectly contributed to numerous results, as summarized below:

· 5 journal publications

· 12 papers in conference proceedings

· 4 PhD thesis (3 finalized; 1 in progress)

· 11 Final bachelor/master projects

Head Pose Estimation Based on 3-D Facial Landmarks Localization and Regression

WINNER OF FG17 HEAD POSE CHALLENGE

D. Derkach, A. Ruiz and F.M. Sukno

FG 2017 Workshop on Dominant and Complementary Emotion Recognition Using Micro Emotion Features and Head-Pose Estimation, Washington, DC, USA, 2017.

In this paper we present a system that is able to estimate head pose using only depth information from consumer RGB-D cameras such as Kinect 2. In contrast to most approaches addressing this problem, we do not rely on tracking and produce pose estimation in terms of pitch, yaw and roll angles using single depth frames as input. Our system combines three different methods for pose estimation: two of them are based on state-of-the-art landmark detection and the third one is a dictionary-based approach that is able to work in especially challenging scans where landmarks or mesh correspondences are too difficult to obtain.

We evaluated our system on the SASE database, which consists of ~30K frames from 50 subjects. We obtained average pose estimation errors between 5 and 8 degrees per angle, achieving the best performance in the FG2017 Head Pose Estimation Challenge. Full code of the developed system is available on-line.

Full paper

Matlab code for head pose estimation

3D face landmarking software

Local Shape Spectrum Analysis for 3D Facial Expression Recognition

D. Derkach and F.M. Sukno

12th IEEE International Conference on Face and Gesture Recognition, Washington, DC, USA, 2017.

We investigate the problem of facial expression recognition using 3D data. Building from one of the most successful frameworks for facial analysis using exclusively 3D geometry, we extend the analysis from a curve-based representation into a spectral representation, which allows a complete description of the underlying surface that can be further tuned to the desired level of detail. Spectral representations are based on the decomposition of the geometry in its spatial frequency components, much like a Fourier transform, which are related to intrinsic characteristics of the surface. In this work, we propose the use of Graph Laplacian Features (GLF), which results from the projection of local surface patches into a common basis obtained from the Graph Laplacian eigenspace.

We test the proposed approach in the BU-3DFE database in terms of expressions and Action Units recognition. Our results confirm that the proposed GLF produces consistently higher recognition rates than the curves-based approach, thanks to a more complete description of the surface, while requiring a lower computational complexity. We also show that the GLF outperform the most popular alternative approach for spectral representation, Shape-DNA, which is based on the Laplace Beltrami Operator and cannot provide a stable basis that guarantee that the extracted signatures for the different patches are directly comparable.

Full paper

Manual AU annotations for high-intensity expressions in the BU-3DFE database

Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database

A. Fernandez-Lopez, O. Martinez and F.M. Sukno

12th IEEE International Conference on Face and Gesture Recognition, Washington, DC, USA, 2017.

Speech is the most used communication method between humans and it involves the perception of auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, although the video can provide information that is complementary to the audio. Exploiting the visual information, however, has proven challenging. On one hand, researchers have reported that the mapping between phonemes and visemes (visual units) is one-to-many because there are phonemes which are visually similar and indistinguishable between them. On the other hand, it is known that some people are very good lip-readers (e.g: deaf people). We study the limit of visual only speech recognition in controlled conditions. With this goal, we designed a new database in which the speakers are aware of being read and aim to facilitate lip-reading.

In the literature, there are discrepancies on whether hearingimpaired people are better lip-readers than normal-hearing people. Then, we analyze if there are differences between the lip-reading abilities of 9 hearing-impaired and 15 normalhearing people. Finally, human abilities are compared with the performance of a visual automatic speech recognition system. In our tests, hearing-impaired participants outperformed the normal-hearing participants but without reaching statistical significance. Human observers were able to decode 44% of the spoken message. In contrast, the visual only automatic system achieved 20% of word recognition rate. However, if we repeat the comparison in terms of phonemes both obtained very similar recognition rates, just above 50%. This suggests that the gap between human lip-reading and automatic speechreading might be more related to the use of context than to the ability to interpret mouth appearance.

Full paper

List of 500 phonetically balanced sentences used in the database

Fusion of Valence and Arousal Annotations through Dynamic Subjective Ordinal Modelling

A. Ruiz, O. Martinez, X. Binefa and F.M. Sukno

12th IEEE International Conference on Face and Gesture Recognition, Washington, DC, USA, 2017.

An essential issue when training and validating computer vision systems for affect analysis is how to obtain reliable ground-truth labels from a pool of subjective annotations. In this paper, we address this problem when labels are given in an ordinal scale and annotated items are structured as temporal sequences. This problem is of special importance in affective computing, where collected data is typically formed by videos of human interactions annotated according to the Valence and Arousal (V-A) dimensions. Moreover, recent works have shown that inter-observer agreement of V-A annotations can be considerably improved if these are given in a discrete ordinal scale. In this context, we propose a novel framework which explicitly introduces ordinal constraints to model the subjective perception of annotators. We also incorporate dynamic information to take into account temporal correlations between ground-truth labels. In our experiments over synthetic and real data with V-A annotations, we show that the proposed method outperforms alternative approaches which do not take into account either the ordinal structure of labels or their temporal correlation.

Full paper

Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading

A. Fernandez-Lopez and F.M. Sukno

12th International Conference on Computer Vision Theory and Applications, Porto, Portugal, 2017.

Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video can provide information that is complementary to the audio. Thus, the study of automatic lip-reading is important and is still an open problem. One of the key challenges is the definition of the visual elementary units (the visemes) and their vocabulary. Many researchers have analyzed the importance of the phoneme to viseme mapping and have proposed viseme vocabularies with lengths between 11 and 15 visemes. These viseme vocabularies have usually been manually defined by their linguistic properties and in some cases using decision trees or clustering techniques. In this work, we focus on the automatic construction of an optimal viseme vocabulary based on the association of phonemes with similar appearance. To this end, we construct an automatic system that uses local appearance descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. To compare the performance of the system different descriptors (PCA, DCT and SIFT) are analyzed. We test our system in a Spanish corpus of continuous speech. Our results indicate that we are able to recognize approximately 58% of the visemes, 47% of the phonemes and 23% of the words in a continuous speech scenario and that the optimal viseme vocabulary for Spanish is composed by 20 visemes..

Full paper

A multimodal annotation schema for non-verbal affective analysis in the health-care domain

F.M. Sukno, M. Domínguez, A. Ruiz, D. Schiller, F. Lingenfelser, L. Pragst, E. Kamateri and S. Vrochidis

1st International Workshop on Multimedia Analysis and Retrieval for Multimodal Interaction, New York, USA, pp. 9–14, 2016.

The development of conversational agents with human interaction capabilities requires advanced affective state recognition integrating non-verbal cues from the different modalities constituting what in human communication we perceive as an overall affective state. Each of the modalities is often handled by a different subsystem that conveys only a partial interpretation of the whole and, as such, is evaluated only in terms of its partial view. To tackle this shortcoming, we investigate the generation of a unified multimodal annotation schema of non-verbal cues from the perspective of an inter-disciplinary group of experts. We aim at obtaining a common ground-truth with a unique representation using the Valence and Arousal space and a discrete non-linear scale of values. The proposed annotation schema is demonstrated on a corpus in the health-care domain but is scalable to other purposes. Preliminary results on inter-rater variability show a positive correlation of consensus level with high (absolute) values of Valence and Arousal as well as with the number of annotators labeling a given video sequence.

Full paper

On the Quantitative Analysis of

Craniofacial Asymmetry in 3D

F.M. Sukno, M.A. Rojas, J.L. Waddington and P.F. Whelan

11th IEEE International Conference on Face and Gesture Recognition, Ljubljana, Slovenia, 2015.

We address a systematic evaluation of facial asymmetry from a population of 100 high-quality laser scans, which are first symmetrized and then manipulated to introduce 25 synthetic patterns with a variety of asymmetries. A quantitative evaluation is performed by comparing these known asymmetries with those estimated by different automatic algorithms. Estimation of the actual asymmetries present in the original surface was also addressed.

We find that widely used methods based on least-squares minimization not only fail to produce accurate estimates but, in some cases, recover asymmetry patterns that are radically different from the actual asymmetry of the input surfaces, with low or even negative correlation coefficients. A number of alternative algorithms are tested, including landmark-, midline- and surface-based approaches. Among these, we find that the best performance is obtained by a hybrid approach combining surface and midline points, framed within a least median of squares algorithm with weights that decay exponentially with the distance from the midline and an additional term to ensure that the recovered pattern of asymmetry is itself symmetric.

Full paper

Supplementary materials

Presentation at FG15 (video)

3D Facial Landmark Localization With Asymmetry Patterns and Shape Regression from Incomplete Local Features Regression

F.M. Sukno, J.L. Waddington and P.F. Whelan

IEEE Transactions on Cybernetics, 45(9): 1717–1730, 2015.

We present a method for the automatic localization of facial landmarks that integrates non-rigid deformation with the ability to handle missing points. The algorithm generates sets of candidate locations from feature detectors and performs combinatorial search constrained by a flexible shape model. A key assumption of our approach is that for some landmarks there might not be an accurate candidate in the input set. This is tackled by detecting partial subsets of landmarks and inferring those that are missing, so that the probability of the flexible model is maximized. The ability of the model to work with incomplete information makes it possible to limit the number of candidates that need to be retained, drastically reducing the number of combinations to be tested with respect to the alternative of trying to always detect the complete set of landmarks.

We demonstrate the accuracy of the proposed method in the Face Recognition Grand Challenge (FRGC) database, where we obtain average errors of approximately 3.5 mm when targeting 14 prominent facial landmarks. For the majority of these our method produces the most accurate results reported to date in this database. Handling of occlusions and surfaces with missing parts is demonstrated with tests on the Bosphorus database, where we achieve an overall error of 4.81 mm and 4.25 mm for data with and without occlusions, respectively. To investigate potential limits in the accuracy that could be reached, we also report experiments on a database of 144 facial scans acquired in the context of clinical research, with manual annotations performed by experts, where we obtain an overall error of 2.3 mm, with averages per landmark below 3.4 mm for all 14 targeted points and within 2 mm for half of them. The coordinates of automatically located landmarks are made available on-line.

Full paper

Supplementary materials


Vide examples of SIFT for landmark localization on the FRGC database		Testing occlusions on the Bosphorus Database

Live 3D facial scanning and landmark detection using Shape Regression with Incomplete Local Features

F.M. Sukno, J.L. Waddington and P.F. Whelan

10^th IEEE International Conference on Face and Gesture Recognition, Shanghai, China, 2013.

This demo is focused on the automatic detection of facial landmarks in surfaces obtained from a hand held laser scanner. The objective is to demonstrate the effectiveness of the algorithm by detecting the landmarks on the facial surface of any person that volunteers to be scanned.

A hand held laser scanner allows acquisition of a 3D surface by gathering measurements made by sweeping the scanning wand over an object (in a manner similar to spray painting). The final surface is obtained by merging the different sweeps which can be from various viewpoints, allowing a complete reconstruction of the facial surface, irrespective of the head pose and possible self-occlusions. For this demo we use a Cobra Wand 298, Polhemus FastSCAN^TM, Colchester, VT, USA). The reconstruction from multiple viewpoints, together with portability and price, are important advantages with respect to single-view scanners.

Landmark localization is accomplished by using SRILF (Shape Regression with Incomplete Features). This algorithm works by calculating a set of candidate points for each landmark and performing combinatorial search, with the key assumption that some landmarks might be missed (i.e. no candidates detected) which is tackled by using partial subsets of landmarks and inferring those that are missing by maximizing their plausibility based on a statistical shape model. Such assumption is crucial for the generalizability of the model for live scanning scenarios, where pre-processing is not possible but to a minimum extent and the quality of the resulting surfaces can vary considerably.

Full demo description

Video 1

Scanning and ladnamrk detection

Video 2

Zoom into landmark localization only

Compensating inaccurate annotations to train 3D facial landmark localization models

F.M. Sukno, J.L. Waddington and P.F. Whelan

Proc. FG Workshop on 3D Face Biometrics, Shanghai, China, pp 1-8, 2013.

In this paper we investigate the impact of inconsistency in manual annotations when they are used to train automatic models for 3D facial landmark localization. We start by showing that it is possible to objectively measure the consistency of annotations in a database, provided that it contains replicates (i.e. repeated scans from the same person). Applying such measure to the widely used FRGC database we find that manual annotations currently available are suboptimal and can strongly impair the accuracy of automatic models learnt therefrom.

To address this issue, we present a simple algorithm to automatically correct a set of annotations and show that it can help to significantly improve the accuracy of the models in terms of landmark localization errors. This improvement is observed even when errors are measured with respect to the original (not corrected) annotations. However, we also show that if errors are computed against an alternative set of manual annotations with higher consistency, the accuracy of the models constructed using the corrections from the presented algorithm tends to converge to the one achieved by building the models on the alternative, more consistent set.

Full paper

Details about the datasets used

Manual annotations for 100 scans from FRGC Database

Asymmetry Patterns Shape Contexts to Describe the 3D Geometry of Craniofacial Landmarks

BEST PAPER AWARD GRAPP 2013

F.M. Sukno, J.L. Waddington and P.F. Whelan

Proc. 8th International Conference on Computer Graphics Theory and Applications, Barcelona, Spain, 2013. Computer Vision, Imaging and Computer Graphics -- Theory and Application, Volume 458, pp 19-3, 2014.

We present a new family of 3D geometry descriptors based on asymmetry patterns from the popular 3D Shape Contexts (3DSC). Our approach resolves the azimuth ambiguity of 3DSC, thus providing rotational invariance, at the expense of a marginal increase in computational load, outperforming previous algorithms dealing with azimuth ambiguity.

We build on a recently presented measure of approximate rotational symmetry in 2D, defined as the overlapping area between a shape and rotated versions of itself, to extract asymmetry patterns from a 3DSC in a variety of ways, depending on the spatial relationships that need to be highlighted or disabled. Thus, we define Asymmetry Patterns Shape Contexts (APSC) from a subset of the possible spatial relations present in the spherical grid of 3DSC; hence they can be thought of as a family of descriptors that depend on the subset that is selected. The possibility to define APSC descriptors by selecting diverse spatial patterns from a 3DSC has two important advantages: (1) choosing the appropriate spatial patterns can considerably reduce the errors obtained with 3DSC when targeting specific types of points; (2) Once one APSC descriptor is built, additional ones can be built with only incremental cost. Therefore, it is possible to use a pool of APSC descriptors to maximize accuracy without a large increase in computational cost.

Full paper

Additional results

Presentation Slides

3D Facial Landmark Localization using Combinatorial Search and Shape Regression

F.M. Sukno, J.L. Waddington and P.F. Whelan

Proc. 5th ECCV Workshop on Non-Rigid Shape Analysis and Deformable Image Alignment, Firenze, Italy, LNCS vol. 7583, pp 32-41, 2012.

This paper presents a method for the automatic detection of facial landmarks. The algorithm receives a set of 3D candidate points for each landmark (e.g. from a feature detector) and performs combinatorial search constrained by a deformable shape model.

A key assumption of our approach is that for some landmarks there might not be an accurate candidate in the input set. This is tackled by detecting partial subsets of landmarks and inferring those that are missing so that the probability of the deformable model is maximized. The ability of the model to work with incomplete information makes it possible to limit the number of candidates that need to be retained, substantially reducing the number of possible combinations to be tested with respect to the alternative of trying to always detect the complete set of landmarks.

We demonstrate the accuracy of the proposed method in a set of 144 facial scans acquired by means of a hand-held laser scanner in the context of clinical craniofacial dysmorphology research. Using spin images to describe the geometry and targeting 11 facial landmarks, we obtain an average error below 3 mm, which compares favorably with other state of the art approaches based on geometric descriptors.

Full paper

Presentation slides

Example video SRILF algorithm adaptation

Comparing 3D descriptors for local search of craniofacial landmarks

F.M. Sukno, J.L. Waddington and P.F. Whelan

Proc. 8th International Symposium on Visual Computing, Rethymnon, Crete, Greece, LNCS vol. 7432, pp 92-103, 2012.

This paper presents a comparison of local descriptors for a set of 26 craniofacial landmarks annotated on 144 scans acquired in the context of clinical research. We focus on the accuracy of the different descriptors on a per-landmark basis when constrained to a local search. For most descriptors, we find that the curves of expected error against the search radius have a plateau that can be used to characterize their performance, both in terms of accuracy and maximum usable range for the local search.

Six histograms-based descriptors were evaluated: three describing distances and three describing orientations. No descriptor dominated over the rest and the best accuracy per landmark was strongly distributed among 3 of the 6 algorithms evaluated. Ordering the descriptors by average error (over all landmarks) did not coincide with the ordering by most frequently selected, indicating that a comparison of descriptors based on their global behavior might be misleading when targeting facial landmarks.

Full paper

Additional tables for variable neighborhood radii

ISVC Presentation slides

A quantitative assessment of 3D facial key point localization fitting 2D shape models to curvature information

F.M. Sukno, T.A. Chowdhury, J.L. Waddington and P.F. Whelan

Proc. Irish Machine Vision and Image Processing Conference, Dublin, Ireland, pp 28-33, 2011.

This work addresses the localization of 11 prominent facial landmarks in 3D by fitting state of the art shape models to 2D data. Quantitative results are provided for 34 scans at high resolution (texture maps of 10 M-pixels) in terms of accuracy (with respect to manual measurements) and precision (repeatability on different images from the same individual).

We obtain an average accuracy of approximately 3 mm, and median repeatability of inter-landmark distances typically below 2 mm, which are values comparable to current algorithms on automatic localization of facial landmarks. We also show that, in our experiments, the replacement of texture information by curvature features produced little change in performance, which is an important finding as it suggests the applicability of the method to any type of 3D data.

Full paper

Presentation slides



Example video 1 IOFASM in the texture image		Example video 2 IOFASM in the curvature image

More...

The above are highlights of some recent publications.

See the full list of publications here