NIST Face Recognition Dataset

Each individual has 3-5 facial images. There are >10M individuals multiple ethnicities in the dataset

Check samples on Kaggle

Dataset Summary

Parameter	Value
Volume	40,000,000+ facial images from 10,000,000+ unique identities
Coverage	Face recognition data prepared for NIST FRVT 1:1 verification and 1:N identification benchmarking
Demographics	Multi-ethnic coverage including Caucasian, Black, East Asian, South Asian, Middle Eastern, and other populations
Devices	Compiled from multiple verified biometric devices
Conditions	Deduplicated and quality-checked through a multi-stage cleaning pipeline

Introduction

The NIST Face Recognition Dataset is a large-scale, deduplicated face dataset built for training and benchmarking face recognition models targeting submission to the NIST Face Recognition Vendor Test (FRVT) – the global benchmark for biometric face recognition algorithm performance.

The dataset contains 40,000,000+ facial images across 10,000,000+ unique identities, with 3–5 verified images per identity, sourced from multiple biometric repositories and processed through a multi-stage cleaning pipeline. It supports both FRVT 1:1 verification and FRVT 1:N identification benchmarks

Dataset Features

Per-identity structure: Each of the 10,000,000+ unique identities is represented by 3–5 verified facial images, providing the within-identity variation needed for face recognition models to learn identity-invariant features rather than overfitting to a single image per person
Identity uniqueness guarantee: Every identity in the dataset has been deduplicated against itself and against the rest of the dataset through a multi-stage cleaning pipeline. No identity appears under multiple IDs, and no near-duplicate images appear within a single identity’s image set
Multi-ethnic coverage by design: Ethnicity distribution is balanced across Caucasian, Black, East Asian, South Asian, Middle Eastern, and other populations to support training of demographically robust face recognition systems and to mitigate the bias effects documented in NIST FRVT demographic studies
Image quality standards: All images meet minimum face size, resolution, and visibility thresholds. Faces with extreme occlusion, motion blur, or insufficient pixel coverage are excluded during quality verification
Cross-source consistency: Images sourced from multiple repositories are normalized to a consistent format and quality baseline, eliminating cross-source artifacts that would otherwise introduce noise into model training

Source and collection methodology

The NIST Face Recognition Dataset is compiled and cleaned from multiple verified biometric sources rather than collected through in-house capture sessions. Source selection and cleaning follow a structured multi-stage pipeline designed to meet the data integrity standards required for NIST FRVT preparation

Source selection

Multiple verified biometric repositories with documented provenance
Sources screened for legal basis and licensing suitability

Data normalization

Cross-source format harmonization (image format, color space, resolution baseline)
Identity label reconciliation across sources
Removal of low-quality and out-of-spec images

Quality assurance and deduplication

Stage 1 – Face size and quality check: Every image is screened against minimum face size, sharpness, and visibility thresholds. Images failing these criteria are excluded before further processing
Stage 2 – Within-identity deduplication: All images grouped under a single identity are checked against each other to remove near-duplicates. The pipeline retains 3–5 unique, non-redundant photos per individual
Stage 3 – Identity validation: Each identity is validated to ensure all photos under one ID belong to the same person. Mismatched groupings are flagged and corrected
Stage 4 – Cross-dataset deduplication: Identities and images are compared across the entire dataset to ensure no identity appears under multiple IDs and no image appears under multiple identities
Stage 5 – Final integrity check: A final pass removes any remaining non-conforming data, format inconsistencies, or quality outliers. Only verified, clean records enter the final dataset

Use cases and applications

NIST FRVT 1:1 Verification Benchmarking: Train and validate face verification models against the scale and diversity required for competitive performance on NIST FRVT 1:1 verification – the global standard for face matching algorithm evaluation
NIST FRVT 1:N Identification Benchmarking: Build face identification models capable of operating against large galleries, exactly the scenario tested by NIST FRVT 1:N identification, where 10,000,000+ identities provide the gallery scale needed for realistic 1:N evaluation
Face Recognition Model Training at Scale: Train commercial-grade face recognition models with the volume of identity coverage required for production deployment, 40,000,000+ images across 10M+ identities provides the diversity that smaller academic datasets cannot
Demographic Bias Evaluation: Audit and improve face recognition models for demographic robustness using a dataset specifically designed with multi-ethnic balance
Pre-Deployment Validation: Stress-test face recognition systems before deployment in production verification, identification, or authentication applications, where false reject rates and demographic disparities directly translate to compliance and revenue risks

Demographic Coverage and Bias Mitigation

NIST published Internal Report 8280, the most comprehensive study to date on demographic effects in face recognition. The report documented that face recognition algorithms exhibit measurable performance differences across demographic groups, false match and false non-match rates can vary by orders of magnitude depending on ethnicity, age, and gender.

For organizations deploying face recognition in regulated contexts: banking, law enforcement, government identity systems, these demographic effects are not academic concerns. They translate directly to compliance risk, regulatory exposure, and reputational damage.

The NIST Face Recognition Dataset is built with explicit demographic balance to support training of more equitable face recognition models. Coverage spans:

Caucasian populations
Black populations
East Asian populations
South Asian populations
Middle Eastern populations
Other populations including Latin American and mixed ethnicity groups

Per-ethnicity image counts and identity counts are documented in the dataset’s accompanying datasheet, available upon request. Models trained on this dataset can be evaluated for demographic robustness

Sample dataset

A sample version of this dataset is available on Kaggle. Leave a request for additional samples in the form below

Related Datasets

Didn’t find what you were looking for? Use existing related datasets:

Selfies and Paired ID Photos (for KYC face matching)
Selfies & Videos Face Recognition Dataset (for video face recognition)

How is this dataset different from public face datasets like LFW, VGGFace2, or MegaFace?

Public academic face datasets have significant limitations for commercial use. LFW contains only ~13,000 images of ~5,700 people, three orders of magnitude smaller than this dataset. VGGFace2 and MegaFace are research-only and cannot be used for commercial face recognition products. The NIST Face Recognition Dataset is built specifically for commercial use, at the scale (40,000,000+ images) and demographic diversity required for production face recognition models

Is this dataset suitable for preparing a NIST FRVT submission?

Yes. The dataset is built with FRVT preparation as its primary use case. The 10,000,000+ unique identity scale matches the gallery sizes used in FRVT 1:N identification testing, and the multi-ethnic demographic balance

What demographic balance does the dataset provide?

Coverage spans Caucasian, Black, East Asian, South Asian, Middle Eastern, and other populations including Latin American and mixed ethnicity groups. Per-ethnicity distribution counts are documented in the accompanying datasheet, available on request

How was deduplication performed?

Through a five-stage cleaning pipeline: face quality check, within-identity deduplication, identity validation, cross-dataset deduplication, and final integrity check. The pipeline ensures no identity appears under multiple IDs and no image appears under multiple identities

Can I get a sample before purchasing?

Yes. We provide free representative samples covering the full demographic range of the main dataset. Submit a request through the form on this page and we will deliver a sample within 1–2 business days

Contact us

Tell us about yourself, and get access to free samples of the dataset

I want to receive communications on the newly added datasets

NIST Face Recognition Dataset