IBM Research is releasing a huge dataset called Diversity in Faces (DiF) to advance facial recognition technology. The first of its kind available to the global research community, DiF provides a dataset of annotations of 1 million human facial images.
Using publicly available images from the YFCC-100M Creative Commons data set, IBM has annotated the faces using 10 coding schemes from the scientific literature.
The coding schemes principally include objective measures of human faces, such as craniofacial features, as well as more subjective annotations, such as human-labeled predictions of age and gender. IBM believes by releasing these facial coding scheme annotations on a large dataset of 1 million images of faces, it will accelerate the study of diversity and coverage of data for AI facial recognition systems to ensure more fair and accurate AI systems.
In a blog on its website, it said; 'We believe the DiF dataset and its 10 coding schemes offer a jumping-off point for researchers around the globe studying the facial recognition technology. The 10 facial coding methods include craniofacial (e.g., head length, nose length, forehead height), facial ratios (symmetry), visual attributes (age, gender), and pose and resolution, among others. These schemes are some of the strongest identified by the scientific literature, building a solid foundation to our collective knowledge.
Our initial analysis has shown that the DiF dataset provides a more balanced distribution and broader coverage of facial images compared to previous datasets. Furthermore, the insights obtained from the statistical analysis of the 10 initial coding schemes on the DiF dataset has furthered our own understanding of what is important for characterizing human faces and enabled us to continue important research into ways to improve facial recognition technology.'