Best Liveness Detection Datasets

Practical Guide to Face Anti-Spoofing Data

Updated 20.04.26

by Axon Labs

A face recognition system that does not also detect spoofing is not deployable. Once your verification model is good enough that it stops failing on real users, the next thing that breaks is an attacker holding up a printed photo, a phone screen, a silicone mask, or a deepfake video. Stopping these is the job of presentation attack detection (PAD), also called face anti-spoofing or liveness detection.

This guide walks through the public PAD datasets that face anti-spoofing teams still rely on in 2026, what each one is good and bad for, and where commercial datasets close the gaps that academic ones leave behind. Every entry uses the same evaluation framework so you can compare them at a glance.

If you are building face recognition itself, see our companion guide to face recognition datasets.

Datasets at a glance

Dataset	Year	Subjects	Volume	Attack types	License	Source
CASIA-FASD	2012	50	600 videos	Print, cut-photo, replay	Research	ieee.org
Replay-Attack	2012	50	1,300 videos	Print, mobile replay, HD replay	Research	idiap.ch
3DMAD	2013	17	76,500 frames	3D paper-craft masks	Research	idiap.ch
MSU-MFSD	2014	35	280 videos	Print, replay (phone, tablet)	Research	cse.msu.edu
Replay-Mobile	2016	40	1,190 videos	Print, replay, mobile capture	Research	idiap.ch
HKBU-MARs V2	2016	12	1,008 videos	6 types of 3D masks, 7 cameras	Research	rds.comp.hkbu.edu.hk
OULU-NPU	2017	55	4,950 videos	Print, replay; 6 mobile devices	Research	sites.google.com
SiW (Spoof in the Wild)	2018	165	4,478 videos	Print, replay	Research	arXiv:1803.11097
ROSE-Youtu	2018	20	3,350 videos	Print, replay, 3D mask	Research	arXiv:1805.05491
SiW-M	2019	493	1,630 videos	13 attack types incl. masks, makeup, partial	Research	arXiv:1904.02860
CASIA-SURF	2019	1,000	21,000 videos	Print, cut-photo	Research	arXiv:1908.10654
CelebA-Spoof	2020	10,177	625,537 images	10 spoof types, 43 attributes	Research	github.com
FaceForensics++	2019	—	1,000 originals + 4,000 manipulated	4 deepfake methods	Research	github.com
Celeb-DF v2	2020	59	5,639 deepfake + 590 real	Deepfakes	Research	arXiv:1909.12962
Silicone Mask Attack	2024	—	10,000+ videos	18 silicone masks	Commercial, GDPR	axonlab.ai
Photo Print Attacks	2025	3,000+	10,000+ videos	Print (matte, glossy, cut-photo, warped)	Commercial, GDPR	axonlab.ai
Replay Display Attacks	2025	6,500+	9,000+ videos	Screen replay (phone, tablet, monitor)	Commercial, GDPR	axonlab.ai

A note up front: none of the public datasets above are licensed for commercial use. Every single one is research-only. If you are building a product that has to pass an iBeta certification, a NIST FRVT submission, or a customer security review, you will eventually need a commercially licensed source – see the Axon Labs’ dataset collection

Classic 2D PAD benchmarks

These are the datasets the field was built on. They are small and they only cover 2D attacks (print and replay), but every face anti-spoofing paper still references them as a comparability anchor

CASIA-FASD

Released by the Chinese Academy of Sciences in 2012. 600 videos from 50 subjects, with three attack types: warped printed photo, cut-photo (eye holes), and video replay on a tablet. Captured at three image qualities (low, normal, high) using different cameras

CASIA-FASD is the original face anti-spoofing benchmark. It is small enough to be used as an evaluation set and is often paired with Replay-Attack for cross-database evaluation in PAD papers. It has no 3D mask attacks at all

Replay-Attack (Idiap)

Released by Idiap Research Institute the same year as CASIA-FASD. 1,300 videos from 50 subjects: 200 real-access videos plus 1,000 attack videos and 100 enrollment videos. Attacks include high-resolution print, mobile-phone replay and tablet replay, captured under two lighting conditions

Replay-Attack and CASIA-FASD together formed the original cross-database benchmark for PAD generalisation. Both are saturated by modern models on the in-database protocol but still useful for cross-database experiments

MSU-MFSD

Released by the Chinese Academy of Sciences in 2014, CASIA-WebFace was the first widely available web-scale training set for face recognition: 494,414 images of 10,575 identities, scraped from IMDb. For several years it was the default training corpus for academic models, and many older open-source face recognition models (early ArcFace, SphereFace) were trained on it.

Today it is too small to train a competitive model from scratch, but it is still a clean, reasonably balanced starting point for fine-tuning experiments. License is research-only

Replay-Mobile

Idiap’s mobile-era follow-up to Replay-Attack. 1,190 videos from 40 subjects captured on mobile devices (smartphone and tablet) under five lighting conditions. Attacks are matte and high-quality print plus screen replay. The first PAD benchmark explicitly designed around mobile capture, which by the late 2010s had become the dominant deployment scenario

Mobile-era and large-scale 2D benchmarks

OULU-NPU

A 2017 release from University of Oulu and NPU. 4,950 videos from 55 subjects, captured on six different mobile devices across three sessions with different illumination and background. Attack types are print and replay, generated on two printers and two display devices

OULU-NPU is the canonical mobile-era PAD benchmark. Its four evaluation protocols specifically test cross-device, cross-environment and cross-attack-type generalisation, which is what production mobile liveness systems actually have to do. If you are evaluating a mobile PAD model and you only run one public benchmark, run OULU-NPU

SiW (Spoof in the Wild)

Michigan State’s larger PAD release. 4,478 videos from 165 subjects, captured at 1080p with significant variations in pose, expression, illumination and distance. Three protocols cover cross-pose-and-expression generalisation, cross-medium attacks and cross-protocol setups. SiW set the standard for “in the wild” 2D PAD before 3D and multi-modal datasets took over

SiW-M (Spoof in the Wild Multi-attack)

The most attack-diverse public PAD dataset. 1,630 videos from 493 subjects, with 13 distinct attack types: print, replay, half mask, silicone mask, transparent mask, paper mask, mannequin head, obfuscation makeup, impersonation makeup, cosmetic makeup, partial paper, partial funny eye, and partial paper-glasses. Designed specifically for zero-shot PAD evaluation, training on a subset of attack types and testing on unseen ones

SiW-M is the closest public proxy for the full attack landscape an iBeta Level 2 / Level 3 evaluation will throw at your model. It is still small (only 1,630 videos), but no other public set covers this many attack types in a single corpus

ROSE-Youtu

3,350 videos from 20 subjects, captured on five mobile devices. Includes printed photo, video replay and 3D paper-mask attacks. Smaller than OULU-NPU but valuable for cross-device PAD research

3D mask attack datasets

The 2D benchmarks above will not protect you from a silicone or latex mask attack. These datasets fill that gap, but they are tiny

3DMAD

The first public 3D mask attack dataset. Idiap, 2013. 76,500 video frames from 17 subjects, captured with Microsoft Kinect (RGB + Depth). Masks are paper-craft 3D head models from ThatsMyFace.com – low fidelity by 2026 standards but still valuable as a cross-modality benchmark

HKBU-MARs V2

Hong Kong Baptist University, 2016. 1,008 videos from 12 subjects, with six types of 3D masks captured on seven different cameras. Designed specifically for remote photoplethysmography (rPPG) PAD research, using subtle skin colour changes to detect a real face

The combination of multiple mask types and multiple cameras makes HKBU-MARs more useful than its small subject count suggests, but 12 people is still 12 people. It is not a training set; it is an evaluation set

Multi-modal and multi-channel PAD

Modern PAD systems increasingly use more than just RGB, depth from a structured-light or time-of-flight sensor, infrared, sometimes thermal. Several datasets target this

CASIA-SURF

The first large-scale multi-modal PAD benchmark. 21,000 videos from 1,000 subjects, captured simultaneously in RGB, Depth and IR with an Intel RealSense camera. Attack types are limited (print and cut-photo), but the modality coverage and subject volume made CASIA-SURF the standard multi-modal PAD benchmark for several years

CASIA-SURF CeFA

The cross-ethnicity extension. 23,583 videos from 1,607 subjects across three ethnicities (African, East Asian, Central Asian), in RGB + Depth + IR. Adds 3D attacks (silicone and paper masks) on top of the 2D attacks from CASIA-SURF. CeFA is the first PAD dataset with explicit ethnic labels, important if you need to demonstrate that your model is not biased across demographic groups, which is increasingly a regulatory requirement

Large-scale 2D spoof datasets

CelebA-Spoof

The largest public PAD dataset by image count. 625,537 images of 10,177 subjects, built on top of CelebA, which is why the subject count is so high. Each image carries 43 attributes: the 40 face attributes inherited from CelebA plus three PAD attributes (spoof type, illumination, environment). Ten spoof types are covered, captured by more than ten sensor models across eight environment-illumination scenes

CelebA-Spoof is the closest the public ecosystem gets to a “train a deep PAD model from scratch” corpus. The trade-off is that it is image-based, not video, so it cannot be used for temporal liveness signals. RGB only, research-only license

Deepfake detection datasets

Deepfakes are a different attack category from physical presentation attacks, but most production face verification systems now have to detect both. The three datasets below are the academic standard for deepfake detection

FaceForensics++

A 2019 release with 1,000 original YouTube videos and corresponding manipulated versions generated by four methods: DeepFakes, Face2Face, FaceSwap and NeuralTextures. Provides four compression levels to test detector robustness

FaceForensics++ is the deepfake detection benchmark that almost every paper still reports on. It is small by modern standards but the four-method coverage and the quality controls make it the canonical evaluation set

Celeb-DF v2

590 original celebrity videos and 5,639 deepfake videos generated with an improved synthesis pipeline. Celeb-DF is widely used as a cross-dataset test set: train on FaceForensics++ or DFDC, test on Celeb-DF, and see how badly your detector generalises. The answer is usually “very badly”, which is the entire point of the dataset

Where public datasets stop being enough

PAD is the area where the gap between public datasets and what production needs is the widest. Six recurring problems:

They are too small for the attack types that matter most. SiW-M has 13 attack types, but only 1,630 videos total. HKBU-MARs has six 3D mask types, but 12 subjects. Public 3D-mask data simply does not exist at training scale
They use low-fidelity attack instruments. Most public 3D mask datasets use paper-craft or low-cost rigid masks. Real attackers use high-quality silicone, custom latex, and increasingly 3D-printed resin. None of these are well represented in academic data
They do not cover iBeta certification requirements. iBeta Level 1 requires print, paper-mask and replay attacks at scale. iBeta Level 2 requires high-quality 3D masks (silicone, latex, resin). iBeta Level 3 requires very high fidelity 3D masks. No public dataset gives you the volume and quality to pass any iBeta level reliably, and a successful iBeta certification is a hard prerequisite for almost every regulated KYC, fintech and identity verification deployment
They lack matched real-and-spoof pairs. Many datasets contain attack videos and bonafide videos from different people. Production PAD models need genuine and attack videos from the same identity to learn the right discriminative features
They under-represent demographics. Most are East Asian and European subjects. Skin tone matters for PAD because some texture and rPPG signals behave differently across tones, and a model that fails on darker skin will fail your fairness review
They are research-only, with no commercial license at all. Every public PAD dataset listed above prohibits commercial use. There is no public PAD dataset you can legally train a shipped product on

For research and benchmarking, public datasets are fine and you should use them. For shipping a liveness system that has to pass iBeta and a customer security review, you will need a commercial source

Commercial liveness datasets from Axon Labs

Axon Labs builds and licenses face anti-spoofing datasets specifically for commercial PAD systems. Every dataset is GDPR-compliant, collected with informed consent, and licensed for commercial training and evaluation. 21% of the companies certified by iBeta in 2025 are Axon Labs clients, these datasets are directly aligned with what passes certification

iBeta Level 1 Dataset

The dataset for teams preparing an iBeta Level 1 PAD certification. 35,000+ attack videos covering all Level 1 presentation attack instrument types: paper masks, photo prints, video replay attacks, captured under the conditions and protocols that iBeta evaluates against. Aligned bonafide videos from the same identities are included → iBeta Level 1 Dataset

iBeta Level 2 Dataset

For Level 2 certification: 25,000+ videos of higher-fidelity 3D attacks including silicone masks, latex masks and cloth 3D masks, plus the Level 1 attack categories. This is the level most fintech and KYC vendors actually need to pass → iBeta Level 2 Dataset

iBeta Level 3 Dataset

For the highest tier of presentation attack certification: high-fidelity 3D mask attacks designed to defeat sophisticated PAD systems → iBeta Level 3 Dataset

Silicone Mask Attack Dataset

A focused dataset of 10,000+ videos generated from 18 different silicone masks, captured under varied lighting and device conditions. Use it when you need targeted training data for the silicone-mask failure mode rather than a full iBeta corpus → Silicone Mask Attack Dataset

Photo Print Attacks Dataset

Paper-based spoofing data from 3,000+ individuals, covering matte, glossy, cut-photo and warped print attacks across multiple printers and paper grades → Photo Print Dataset

Replay Display Attacks Dataset

Screen-replay spoofing across multiple display devices (smartphones, tablets, monitors) and recording cameras, the second most common 2D attack vector after print → Replay Display Dataset

Specialised mask and PAI datasets

Beyond the iBeta-aligned core, Axon Labs maintains specialised attack datasets for narrower research and evaluation needs: high-fidelity rubber masks, 3D resin masks, wrapped 3D attacks, advanced 3D paper masks, 3D latex masks, cutout prints, and IR+RGB iBeta data. See the full datasets catalogue for the complete list

Custom liveness data collection

When your problem is narrower than any catalogue dataset, a specific attack instrument, a specific device, a specific country, a specific demographic, Axon Labs runs custom collections under the same consent and compliance framework → Contact us for a scoped quote

Which dataset should you actually use?

A short decision guide based on what we hear most often from PAD teams:

You are writing a paper or running a benchmark. Use OULU-NPU for mobile 2D PAD, SiW-M for multi-attack zero-shot, and CASIA-SURF CeFA for multi-modal cross-ethnicity. Cross-database evaluation against CASIA-FASD and Replay-Attack is still expected by most reviewers
You are training a deep PAD model from scratch on RGB only. CelebA-Spoof is the only public option with the volume to support it
You are building a multi-spectral PAD system. Use WMCA or HQ-WMCA. There is no public substitute for the four- and five-channel coverage
You are preparing for iBeta Level 1 certification. Public data will not get you there → iBeta Level 1 Dataset
You are preparing for iBeta Level 2 → iBeta Level 2 Dataset
You are building an anti-spoofing system for production deployment (KYC, banking, identity verification) → start with iBeta Level 1 Dataset and expand to iBeta Level 2 Dataset as the system matures. Public datasets alone will not deliver production-ready coverage
You are integrating anti-spoofing into a biometric authentication system → combine iBeta Level 1/2 Datasets with face recognition data (Selfies and Paired ID Photos) for end-to-end identity verification training
You need to harden a production PAD model against silicone masks specifically → Silicone Mask Attack Dataset
You are also building face recognition. See our companion guide on face recognition datasets
You need to detect deepfakes as well as physical presentation attacks. Train on DFDC, evaluate cross-domain on Celeb-DF and FaceForensics++. Combine with a physical PAD model: deepfake detectors and physical-attack detectors do not generalise to each other

Frequently asked questions

What data do I need to build a production anti-spoofing system?

A production anti-spoofing system requires training data covering the full range of attack vectors it will face in deployment - not just the narrow attack types used in academic research. For most use cases, this means combining several dataset types: paper and print attack data, replay attack data (smartphone, monitor, tablet), and 3D mask data (silicone, latex, paper-wrapped) for iBeta Level 2 readiness. The exact mix depends on your target certification level and threat model. Most companies start with the iBeta Level 1 Dataset for foundational coverage, then expand to Level 2 datasets as the anti-spoofing system matures

How does anti-spoofing fit into a biometric authentication system?

An anti-spoofing system is the defense layer that protects a biometric authentication system from presentation attacks, preventing someone from bypassing face recognition with a printed photo, replayed video, or 3D mask. In a typical biometric authentication pipeline, the anti-spoofing system runs before identity matching: first, confirm the input is a real human face (anti-spoofing); then, match the face against an enrolled identity (face recognition). Both layers require their own training data - anti-spoofing datasets (attacks + bona fide) for liveness detection, and face recognition datasets for identity matching. The two work together to produce a secure biometric authentication system

What is the best public face anti-spoofing dataset?

There is no single best dataset, the right choice depends on what attack types and modalities you care about. For mobile 2D PAD, OULU-NPU is the canonical benchmark. For attack-type diversity, SiW-M covers 13 attack categories in one corpus. For multi-modal PAD, WMCA (RGB + Depth + IR + Thermal) and CASIA-SURF CeFA (RGB + Depth + IR with cross-ethnicity labels) are the strongest options. For training a deep model from scratch, CelebA-Spoof is the largest by image count

Can I use public PAD datasets in a commercial product?

No. Every public face anti-spoofing dataset listed above is licensed for academic research only. Training a commercial liveness system on them is a license violation, and most customer security reviews will catch it

Which dataset do I need to pass iBeta Level 1?

No public dataset covers iBeta Level 1 protocols at the volume and quality required. Commercially licensed datasets such as the iBeta Level 1 Dataset from Axon Labs are built specifically against the iBeta PAI requirements

Which face anti-spoofing dataset includes 3D silicone masks?

SiW-M includes a small number of silicone mask videos as one of its 13 attack categories. WMCA and HQ-WMCA include flexible silicone masks alongside other PAI types in a multi-channel setup. None of these are large enough for production training. The Silicone Mask Attack Dataset from Axon Labs provides 10,000+ videos generated from 18 distinct silicone masks under commercial license

Do I need a separate dataset for deepfake detection?

Yes. Physical presentation attack detectors (trained on print, replay, mask data) do not transfer to deepfake detection, and vice versa. Production face verification systems usually run two models in parallel, a physical PAD model and a deepfake detector, and need separate training data for each

How do I evaluate my PAD model for demographic fairness?

Use CASIA-SURF CeFA, which has explicit ethnicity labels across African, East Asian and Central Asian groups, and report per-group APCER (attack presentation classification error rate) and BPCER (bona fide presentation classification error rate). Treat any group-to-group gap of more than a few percentage points as a finding that needs targeted training data, not just a tweaked threshold

Accelerate Your AI Development Today

Speed up your AI projects with our high-quality, ready-to-use datasets. Enjoy easy integration, fast deployment, and reliable biometric data collection