Face Recognition Datasets

Complete Industry Guide

April, 2026

by Axon Labs

Choosing a face recognition dataset is rarely about finding the largest one. It is about matching what your model actually needs: identity coverage, demographic balance, image conditions, and a license that lets you ship, to what is actually available. Over the last fifteen years the field has produced dozens of public datasets, several of which have since been withdrawn, relicensed, or quietly outgrown by what production systems require today

This guide walks through the public face recognition datasets that ML teams still rely on in 2026, what each one is good (and bad) for, and where commercial datasets fill the gaps that the open ones leave behind

Public Datasets at a glance

Dataset
Identities
Images
Year
License
Source
LFW
5,749
13,233
2007
Research
YTF
1,595
3,425 videos
2011
Research
CASIA-WebFace
10,575
494,414
2014
Research
CelebA
10,177
202,599
2015
Research
VGGFace2
9,131
3.31M
2017
Non-commercial
AgeDB
568
16,488
2017
Research
CFP
500
7,000
2017
Research
FairFace
-
108,501
2019
CC BY 4.0
RFW
~28,000
~40,000
2019
Research
Glint360K
360,232
17,091,657
2021
Research
DigiFace-1M
110,000
1.22M
2022
Microsoft Research License
Selfies & Videos FR
1,000+
100,000+ files
2025
Commercial, GDPR
Selfies & Paired ID Photos
1,000+
Selfie + ID per subject
2025
Commercial, GDPR
NIST FRVT Evaluation
10,000,000+
40,000,000+
2025
Commercial, GDPR

The classic benchmarks

Released by the University of Massachusetts in 2007, LFW is the dataset that defined face verification as a benchmark task. It contains 13,233 images of 5,749 individuals, collected from news photos on the web. Only 1,680 people have two or more images, which limits its usefulness as a training set; almost everyone uses it as an evaluation set
A 2011 video counterpart to LFW: 3,425 videos of 1,595 people, average length 181 frames. Standard protocol is 5,000 video pairs. Like LFW, it has been saturated by modern models, but it is still a common smoke test for video-based verification systems

Released by the Chinese Academy of Sciences in 2014, CASIA-WebFace was the first widely available web-scale training set for face recognition: 494,414 images of 10,575 identities, scraped from IMDb. For several years it was the default training corpus for academic models, and many older open-source face recognition models (early ArcFace, SphereFace) were trained on it.

Today it is too small to train a competitive model from scratch, but it is still a clean, reasonably balanced starting point for fine-tuning experiments. License is research-only

CelebA is a celebrity face attribute dataset from CUHK: 202,599 images of 10,177 identities, each annotated with 40 binary attributes (smiling, eyeglasses, blond hair, etc.) and 5 landmark points. Its main use is attribute prediction and facial editing rather than identity recognition, the per-identity image count is too low to train a strong recognition model, but it is the standard dataset for any task that involves face attributes. Non-commercial research license

Mid-size workhorses

Released by VGG (Oxford) in 2017, VGGFace2 contains 3.31 million images of 9,131 identities – an average of 362 images per person, which is high enough to learn rich identity representations. Crucially, the dataset was curated with pose, age and illumination variation in mind, which made it the de-facto fine-tuning dataset of the late 2010s. Many open-source models: InsightFace, FaceNet variants, ship VGGFace2 weights

AgeDB

16,488 images of 568 subjects with manually verified age labels spanning 1 to 101 years. It is the standard benchmark for age-invariant face recognition. Small enough to be used as an evaluation set rather than for training — for training data that captures the same person across years at scale, see the commercially licensed Selfies & Videos Face Recognition Dataset

7,000 images of 500 subjects, built specifically to test frontal-to-profile verification, one of the harder real-world cases. Each subject has 10 frontal and 4 profile images. Use it as an evaluation set when your application has strong pose variation

Demographic and bias datasets

Almost every classic dataset is demographically skewed – typically toward white, Western, adult faces. If your model is going to be deployed across markets, you need to evaluate it on data that is balanced by design
108,501 face images balanced across seven race groups (White, Black, East Asian, Southeast Asian, Indian, Middle Eastern, Latino), with age and gender labels. CC BY 4.0 – one of the few face datasets with a genuinely permissive licence. FairFace is meant for evaluation and bias auditing, not for training a production model, but it is the default tool for measuring whether your model degrades on under-represented groups
A targeted bias benchmark: roughly 40,000 images of about 28,000 identities split into four racial groups (Caucasian, Indian, Asian, African). Each group has its own verification protocol so you can compute per-group accuracy and identify disparate error rates. Research licence

Large-scale training sets

Released by the InsightFace team in 2021, Glint360K is a cleaned merge of MS1M and Celeb-500K: 17,091,657 images across 360,232 identities. It is currently the largest openly distributed training set for face recognition that has not been retracted. Most open-source ArcFace and AdaFace checkpoints from 2021 onward are trained on it. Research licence
A different approach: 1.22 million synthetic images of 110,000 identities, rendered by Microsoft Research using a parametric face model. Because no real person is photographed, DigiFace-1M sidesteps the consent problem entirely. Models trained purely on DigiFace are still 2–4 percentage points behind models trained on real data of comparable size, but the gap is closing, and synthetic data is now a credible privacy-safe pre-training strategy

Where public datasets stop being enough

Public datasets are excellent for benchmarking, prototyping, and academic publication. They start to break down the moment you have to build a product. Five recurring problems:

  1. They are old: LFW, CASIA, VGGFace2 were all collected before smartphones with high-resolution front cameras became universal. They under-represent the exact images your product will see in production.
    The Selfies & Videos Face Recognition Dataset is built around modern multi-device capture and the same subjects across time
  2. They lack ID-document pairings: No public face dataset ships selfie-vs-government-ID image pairs, which are the foundation of every KYC flow. This is the single biggest gap for fintech and identity verification teams
    – and exactly what the Selfies & Paired ID Photos Dataset was built to fill
  3. They are demographically narrow: FairFace and RFW help you measure bias, but they do not give you enough volume to fix it. Closing fairness gaps requires balanced training data, which has to be collected on purpose
    contact us for a scoped collection
  4. They have unclear licenses, or non-commercial ones: WebFace260M, VGGFace2, CASIA, CelebA – all research-only or share-alike. MS-Celeb-1M and MegaFace are withdrawn outright. If your compliance review asks “where did this data come from and do you have consent,” public datasets rarely have a clean answer
  5. They do not include liveness pairing: A face recognition system that does not also detect spoofing is not deployable. None of the datasets above include matched genuine/attack pairs for that you need a separate liveness dataset 

Commercial datasets that fill these gaps

Axon Labs builds and licenses face recognition datasets specifically to cover what the open ecosystem misses. Each one is GDPR-compliant, collected with documented consent, and licensed for commercial use

Selfies & Videos Face Recognition Dataset

A dataset built for face recognition systems that need both image and video data of the same person across time. 1,000+ subjects, each represented by selfies, short videos and archive photos spanning several years: covering aging, lighting, hairstyle and device variation that no public dataset captures together. Collected with consent, GDPR-compliant, licensed for commercial use → Selfies & Videos Face Recognition Dataset

Selfies & Paired ID Photos Dataset

The dataset for KYC and onboarding: selfie images paired with the same person’s government ID photo. Every pair is collected with explicit consent and a signed release. Use it to train and evaluate selfie-to-document matching, the core operation in any digital onboarding flow → Selfies & Paired ID Photos

NIST FRVT Evaluation Dataset

A dataset structured around the NIST evaluation protocols, designed for teams preparing a vendor submission or running internal benchmarks against NIST-style conditions → NIST Dataset

Demographically balanced face data

Custom collections balanced by gender, age band and ethnicity for fairness-critical deployments. Useful when your model degrades on a specific group and you need targeted training data to fix it → Contact us for a scoped quote

Frequently asked questions

There is no single best dataset - the right choice depends on your task. For training, WebFace42M and Glint360K are the strongest open options. For evaluation, IJB-C is the most operationally meaningful benchmark, and FairFace plus RFW are essential for fairness audits. LFW is still useful as a comparability anchor but is too saturated to discriminate between modern models

No. Both are released under research-only licences. Using them to train a model that you sell or deploy commercially is a licence violation, and most customer security reviews will catch it

No. Selfie-to-document matching is the foundation of digital KYC, but no academic dataset provides it at meaningful scale. Commercial datasets like the Selfies & Paired ID Photos Datasetexist precisely to fill this gap

Not in the public domain, academic datasets typically capture each subject in a single session. The Selfies & Videos Face Recognition Dataset from Axon Labs is built specifically for this case, with 1,000+ subjects each represented by selfies, short videos and archival photos spanning several years

Run your model and report per-group verification accuracy. If you observe a gap of more than a few percentage points between groups, you need balanced training data, not just rebalanced loss functions

The IJB-B and IJB-C benchmarks are the closest public approximation to FRVT conditions. For preparation that mirrors the actual FRVT protocol structure, see our NIST Dataset

Accelerate Your AI Development Today

Speed up your AI projects with our high-quality, ready-to-use datasets. Enjoy easy integration, fast deployment, and reliable biometric data collection

© 2022 – 2026 Copyright protected