Best Face Recognition Datasets in 2026: Public & Commercial

Face Recognition Datasets

Complete Industry Guide

Updated 20.04.26

by Axon Labs

Choosing a face recognition dataset is rarely about finding the largest one. It is about matching what your model actually needs: identity coverage, demographic balance, image conditions, and a license that lets you ship, to what is actually available. Over the last fifteen years the field has produced dozens of public datasets, several of which have since been withdrawn, relicensed, or quietly outgrown by what production systems require today

This guide walks through the public face recognition datasets that ML teams still rely on in 2026, what each one is good (and bad) for, and where commercial datasets fill the gaps that the open ones leave behind

Public Datasets at a glance

Dataset	Identities	Images	Year	License	Source
LFW	5,749	13,233	2007	Research	paperswithcode.com
YTF	1,595	3,425 videos	2011	Research	cslab.openu.ac.il
CASIA-WebFace	10,575	494,414	2014	Research	arXiv:1411.7923
CelebA	10,177	202,599	2015	Research	mmlab.ie.cuhk.edu.hk
VGGFace2	9,131	3.31M	2017	Non-commercial	arXiv:1710.08092
AgeDB	568	16,488	2017	Research	ibug.doc.ic.ac.uk
CFP	500	7,000	2017	Research	cfpw.io
FairFace	-	108,501	2019	CC BY 4.0	github.com/joojs
RFW	~28,000	~40,000	2019	Research	arXiv:1812.00194
Glint360K	360,232	17,091,657	2021	Research	github.com/deepinsight
DigiFace-1M	110,000	1.22M	2022	Microsoft Research License	github.com/microsoft
Selfies & Videos FR	1,000+	100,000+ files	2025	Commercial, GDPR	axonlab.ai
Selfies & Paired ID Photos	1,000+	Selfie + ID per subject	2025	Commercial, GDPR	axonlab.ai
NIST FRVT Evaluation	10,000,000+	40,000,000+	2025	Commercial, GDPR	axonlab.ai

The classic benchmarks

LFW (Labeled Faces in the Wild)

Released by the University of Massachusetts in 2007, LFW is the dataset that defined face verification as a benchmark task. It contains 13,233 images of 5,749 individuals, collected from news photos on the web. Only 1,680 people have two or more images, which limits its usefulness as a training set; almost everyone uses it as an evaluation set

YouTube Faces (YTF)

A 2011 video counterpart to LFW: 3,425 videos of 1,595 people, average length 181 frames. Standard protocol is 5,000 video pairs. Like LFW, it has been saturated by modern models, but it is still a common smoke test for video-based verification systems

CASIA-WebFace

Released by the Chinese Academy of Sciences in 2014, CASIA-WebFace was the first widely available web-scale training set for face recognition: 494,414 images of 10,575 identities, scraped from IMDb. For several years it was the default training corpus for academic models, and many older open-source face recognition models (early ArcFace, SphereFace) were trained on it.

Today it is too small to train a competitive model from scratch, but it is still a clean, reasonably balanced starting point for fine-tuning experiments. License is research-only

CelebA

CelebA is a celebrity face attribute dataset from CUHK: 202,599 images of 10,177 identities, each annotated with 40 binary attributes (smiling, eyeglasses, blond hair, etc.) and 5 landmark points. Its main use is attribute prediction and facial editing rather than identity recognition, the per-identity image count is too low to train a strong recognition model, but it is the standard dataset for any task that involves face attributes. Non-commercial research license

Mid-size workhorses

VGGFace2

Released by VGG (Oxford) in 2017, VGGFace2 contains 3.31 million images of 9,131 identities – an average of 362 images per person, which is high enough to learn rich identity representations. Crucially, the dataset was curated with pose, age and illumination variation in mind, which made it the de-facto fine-tuning dataset of the late 2010s. Many open-source models: InsightFace, FaceNet variants, ship VGGFace2 weights

AgeDB

16,488 images of 568 subjects with manually verified age labels spanning 1 to 101 years. It is the standard benchmark for age-invariant face recognition. Small enough to be used as an evaluation set rather than for training — for training data that captures the same person across years at scale, see the commercially licensed Selfies & Videos Face Recognition Dataset

CFP (Celebrities in Frontal-Profile)

7,000 images of 500 subjects, built specifically to test frontal-to-profile verification, one of the harder real-world cases. Each subject has 10 frontal and 4 profile images. Use it as an evaluation set when your application has strong pose variation

Demographic and bias datasets

Almost every classic dataset is demographically skewed – typically toward white, Western, adult faces. If your model is going to be deployed across markets, you need to evaluate it on data that is balanced by design

FairFace

108,501 face images balanced across seven race groups (White, Black, East Asian, Southeast Asian, Indian, Middle Eastern, Latino), with age and gender labels. CC BY 4.0 – one of the few face datasets with a genuinely permissive licence. FairFace is meant for evaluation and bias auditing, not for training a production model, but it is the default tool for measuring whether your model degrades on under-represented groups

RFW (Racial Faces in the Wild)

A targeted bias benchmark: roughly 40,000 images of about 28,000 identities split into four racial groups (Caucasian, Indian, Asian, African). Each group has its own verification protocol so you can compute per-group accuracy and identify disparate error rates. Research licence

Large-scale training sets

Glint360K

Released by the InsightFace team in 2021, Glint360K is a cleaned merge of MS1M and Celeb-500K: 17,091,657 images across 360,232 identities. It is currently the largest openly distributed training set for face recognition that has not been retracted. Most open-source ArcFace and AdaFace checkpoints from 2021 onward are trained on it. Research licence

DigiFace-1M

A different approach: 1.22 million synthetic images of 110,000 identities, rendered by Microsoft Research using a parametric face model. Because no real person is photographed, DigiFace-1M sidesteps the consent problem entirely. Models trained purely on DigiFace are still 2–4 percentage points behind models trained on real data of comparable size, but the gap is closing, and synthetic data is now a credible privacy-safe pre-training strategy

Where public datasets stop being enough

Public datasets are excellent for benchmarking, prototyping, and academic publication. They start to break down the moment you have to build a product. Five recurring problems:

They are old: LFW, CASIA, VGGFace2 were all collected before smartphones with high-resolution front cameras became universal. They under-represent the exact images your product will see in production.
The Selfies & Videos Face Recognition Dataset is built around modern multi-device capture and the same subjects across time
They lack ID-document pairings: No public face dataset ships selfie-vs-government-ID image pairs, which are the foundation of every KYC flow. This is the single biggest gap for fintech and identity verification teams
– and exactly what the Selfies & Paired ID Photos Dataset was built to fill
They are demographically narrow: FairFace and RFW help you measure bias, but they do not give you enough volume to fix it. Closing fairness gaps requires balanced training data, which has to be collected on purpose
– contact us for a scoped collection
They have unclear licenses, or non-commercial ones: WebFace260M, VGGFace2, CASIA, CelebA – all research-only or share-alike. MS-Celeb-1M and MegaFace are withdrawn outright. If your compliance review asks “where did this data come from and do you have consent,” public datasets rarely have a clean answer
They do not include liveness pairing: A face recognition system that does not also detect spoofing is not deployable. None of the datasets above include matched genuine/attack pairs for that you need a separate liveness dataset

Commercial datasets that fill these gaps

Axon Labs builds and licenses face recognition datasets specifically to cover what the open ecosystem misses. Each one is collected with documented consent, and licensed for commercial use

Selfies & Videos Face Recognition Dataset

A dataset built for face recognition systems that need both facial image and video data of the same person across time. 1,000+ subjects, each represented by selfies, short videos and archive facial images spanning several years: covering aging, lighting, hairstyle and device variation that no public dataset captures together. Collected with consent, GDPR-compliant, licensed for commercial use → Selfies & Videos Face Recognition Dataset

Selfies & Paired ID Photos Dataset

The dataset for KYC and onboarding: facial images images paired with the same person’s government ID photo. Every pair is collected with explicit consent and a signed release. Use it to train and evaluate selfie-to-document matching, the core operation in any digital onboarding flow → Selfies & Paired ID Photos

NIST FRVT Evaluation Dataset

A dataset structured around the NIST evaluation protocols, designed for teams preparing a vendor submission or running internal benchmarks against NIST-style conditions → NIST Dataset

Demographically balanced face data

Custom collections balanced by gender, age band and ethnicity for fairness-critical deployments. Useful when your model degrades on a specific group and you need targeted training data to fix it → Contact us for a scoped quote

Frequently asked questions

What is the best public dataset for face recognition?

There is no single best dataset - the right choice depends on your task. For training, WebFace42M and Glint360K are the strongest open options. For evaluation, IJB-C is the most operationally meaningful benchmark, and FairFace plus RFW are essential for fairness audits. LFW is still useful as a comparability anchor but is too saturated to discriminate between modern models

Are LFW and CASIA-WebFace free for commercial use?

No. Both are released under research-only licences. Using them to train a model that you sell or deploy commercially is a licence violation, and most customer security reviews will catch it

Is there a public face dataset with paired ID document photos?

No. Selfie-to-document matching is the foundation of digital KYC, but no academic dataset provides it at meaningful scale. Commercial datasets like the Selfies & Paired ID Photos Datasetexist precisely to fill this gap

Is there a dataset that combines selfies, videos and historical photos of the same person?

Not in the public domain, academic datasets typically capture each subject in a single session. The Selfies & Videos Face Recognition Dataset from Axon Labs is built specifically for this case, with 1,000+ subjects each represented by selfies, short videos and archival photos spanning several years

How do I evaluate my face recognition model for demographic bias?

Run your model and report per-group verification accuracy. If you observe a gap of more than a few percentage points between groups, you need balanced training data, not just rebalanced loss functions

What dataset should I use to prepare for a NIST FRVT submission?

The IJB-B and IJB-C benchmarks are the closest public approximation to FRVT conditions. For preparation that mirrors the actual FRVT protocol structure, see our NIST Dataset

Accelerate Your AI Development Today

Speed up your AI projects with our high-quality, ready-to-use datasets. Enjoy easy integration, fast deployment, and reliable biometric data collection