NIST Face Recognition Dataset

NIST Face Recognition Dataset

Each individual has 3-5 different photos. There are >10M individuals multiple ethnicities in the dataset

Check samples on Kaggle

Dataset Summary

Parameter
Value
Volume
40,000,000+ images from 10,000,000+ unique identities
Coverage
Face recognition data prepared for NIST FRVT 1:1 verification and 1:N identification benchmarking
Demographics
Multi-ethnic coverage including Caucasian, Black, East Asian, South Asian, Middle Eastern, and other populations
Devices
Compiled from multiple verified biometric devices
Conditions
Deduplicated and quality-checked through a multi-stage cleaning pipeline

Introduction

The NIST Face Recognition Dataset is a large-scale, deduplicated face dataset built for training and benchmarking face recognition models targeting submission to the NIST Face Recognition Vendor Test (FRVT) – the global benchmark for biometric face recognition algorithm performance.

The dataset contains 40,000,000+ face images across 10,000,000+ unique identities, with 3–5 verified images per identity, sourced from multiple biometric repositories and processed through a multi-stage cleaning pipeline. It supports both FRVT 1:1 verification and FRVT 1:N identification benchmarks

Dataset Features

  • Per-identity structure: Each of the 10,000,000+ unique identities is represented by 3–5 verified images, providing the within-identity variation needed for face recognition models to learn identity-invariant features rather than overfitting to a single image per person
  • Identity uniqueness guarantee: Every identity in the dataset has been deduplicated against itself and against the rest of the dataset through a multi-stage cleaning pipeline. No identity appears under multiple IDs, and no near-duplicate images appear within a single identity’s image set
  • Multi-ethnic coverage by design: Ethnicity distribution is balanced across Caucasian, Black, East Asian, South Asian, Middle Eastern, and other populations to support training of demographically robust face recognition systems and to mitigate the bias effects documented in NIST FRVT demographic studies
  • Image quality standards: All images meet minimum face size, resolution, and visibility thresholds. Faces with extreme occlusion, motion blur, or insufficient pixel coverage are excluded during quality verification
  • Cross-source consistency: Images sourced from multiple repositories are normalized to a consistent format and quality baseline, eliminating cross-source artifacts that would otherwise introduce noise into model training

Source and collection methodology

The NIST Face Recognition Dataset is compiled and cleaned from multiple verified biometric sources rather than collected through in-house capture sessions. Source selection and cleaning follow a structured multi-stage pipeline designed to meet the data integrity standards required for NIST FRVT preparation. Data processing complies with GDPR Article 9 for the processing of biometric data

Source selection

  • Multiple verified biometric repositories with documented provenance
  • Sources screened for legal basis and licensing suitability

Data normalization

  • Cross-source format harmonization (image format, color space, resolution baseline)
  • Identity label reconciliation across sources
  • Removal of low-quality and out-of-spec images

Compliance and documentation

  • Full provenance log for every included identity
  • Documentation available upon request for compliance review

Quality assurance and deduplication

  • Stage 1 – Face size and quality check: Every image is screened against minimum face size, sharpness, and visibility thresholds. Images failing these criteria are excluded before further processing
  • Stage 2 – Within-identity deduplication: All images grouped under a single identity are checked against each other to remove near-duplicates. The pipeline retains 3–5 unique, non-redundant photos per individual
  • Stage 3 – Identity validation: Each identity is validated to ensure all photos under one ID belong to the same person. Mismatched groupings are flagged and corrected
  • Stage 4 – Cross-dataset deduplication: Identities and images are compared across the entire dataset to ensure no identity appears under multiple IDs and no image appears under multiple identities
  • Stage 5 – Final integrity check: A final pass removes any remaining non-conforming data, format inconsistencies, or quality outliers. Only verified, clean records enter the final dataset

Use cases and applications

  • NIST FRVT 1:1 Verification Benchmarking: Train and validate face verification models against the scale and diversity required for competitive performance on NIST FRVT 1:1 verification – the global standard for face matching algorithm evaluation
  • NIST FRVT 1:N Identification Benchmarking: Build face identification models capable of operating against large galleries, exactly the scenario tested by NIST FRVT 1:N identification, where 10,000,000+ identities provide the gallery scale needed for realistic 1:N evaluation
  • Face Recognition Model Training at Scale: Train commercial-grade face recognition models with the volume of identity coverage required for production deployment, 40,000,000+ images across 10M+ identities provides the diversity that smaller academic datasets cannot
  • Demographic Bias Evaluation: Audit and improve face recognition models for demographic robustness using a dataset specifically designed with multi-ethnic balance
  • Pre-Deployment Validation: Stress-test face recognition systems before deployment in production verification, identification, or authentication applications, where false reject rates and demographic disparities directly translate to compliance and revenue risks

Demographic Coverage and Bias Mitigation

NIST published Internal Report 8280, the most comprehensive study to date on demographic effects in face recognition. The report documented that face recognition algorithms exhibit measurable performance differences across demographic groups, false match and false non-match rates can vary by orders of magnitude depending on ethnicity, age, and gender.

For organizations deploying face recognition in regulated contexts: banking, law enforcement, government identity systems, these demographic effects are not academic concerns. They translate directly to compliance risk, regulatory exposure, and reputational damage.

The NIST Face Recognition Dataset is built with explicit demographic balance to support training of more equitable face recognition models. Coverage spans:

  • Caucasian populations
  • Black populations
  • East Asian populations
  • South Asian populations
  • Middle Eastern populations
  • Other populations including Latin American and mixed ethnicity groups

Per-ethnicity image counts and identity counts are documented in the dataset’s accompanying datasheet, available upon request. Models trained on this dataset can be evaluated for demographic robustness

Legal & Compliance

We prioritize data privacy, ethical AI development, and regulatory compliance. Our iBeta Level 2 Dataset is collected and processed in full accordance with global data protection standards including GDPR, ensuring legality, security, and responsible AI practices

Sample dataset

A sample version of this dataset is available on Kaggle. Leave a request for additional samples in the form below

Related Datasets

Didn’t find what you were looking for? Use existing related datasets:

Public academic face datasets have significant limitations for commercial use. LFW contains only ~13,000 images of ~5,700 people, three orders of magnitude smaller than this dataset. VGGFace2 and MegaFace are research-only and cannot be used for commercial face recognition products. The NIST Face Recognition Dataset is built specifically for commercial use, at the scale (40,000,000+ images) and demographic diversity required for production face recognition models

Yes. The dataset is built with FRVT preparation as its primary use case. The 10,000,000+ unique identity scale matches the gallery sizes used in FRVT 1:N identification testing, and the multi-ethnic demographic balance

Coverage spans Caucasian, Black, East Asian, South Asian, Middle Eastern, and other populations including Latin American and mixed ethnicity groups. Per-ethnicity distribution counts are documented in the accompanying datasheet, available on request

Through a five-stage cleaning pipeline: face quality check, within-identity deduplication, identity validation, cross-dataset deduplication, and final integrity check. The pipeline ensures no identity appears under multiple IDs and no image appears under multiple identities

Yes. All data is processed in compliance with GDPR, the article governing biometric data processing. Source datasets were selected with consideration for legal basis and consent requirements. Comprehensive compliance documentation is available upon request

Yes. We provide free representative samples covering the full demographic range of the main dataset. Submit a request through the form on this page and we will deliver a sample within 1–2 business days

Contact us

Tell us about yourself, and get access to free samples of the dataset 

© 2022 – 2026 Copyright protected