Best Open-Source Face Liveness Detection Models 2026

Which one actually catches modern face spoofing attacks

Updated 22.04.26

by Axon Labs

Quick Summary

  • We benchmarked the 6 most credible open-source face liveness detection models (2019-2024) on an identical 2,829-frame dataset: 1,205 spoof frames across 9 iBeta-aligned attack categories + 1,624 real faces
  • Best open-source face liveness detection model: CVPR 2024 Workshop Challenge winner (Swin Transformer V2 Base, He et al.) – ACER 23.68%, BPCER 2.59%. The only open-source model with production-acceptable real-user acceptance combined with reasonable attack detection
  • Worst open-source model: IADG (CVPR 2023) – ACER 60.43%, despite being a domain-generalization-targeted model. The model’s score distributions for real faces and silicone masks are nearly identical (means 0.096 vs 0.066), making it incapable of distinguishing them on our attack distribution. Even cross-dataset SOTA from CVPR 2023 cannot close the training-data gap

  • No open-source model passes iBeta certification, neither Level 1 nor Level 2. iBeta PAD requires ≈ 0% APCER per attack category. Our best open-source model achieves 35.3% APCER on Class A (Level 1 equivalent: paper + replay attacks) and 23.1% APCER on Class B (Level 2 equivalent: silicone / latex / textile 3D masks). Both are far above the certification threshold
  • The 2019 → 2024 trajectory matters: ACER improved 2.4× across the 5-year span, dominated by training-data scale and modern architectures 
  • Root cause of the open-source ceiling: training-data distribution, not architecture. Modern silicone, latex, and textile 3D masks remain under-represented or absent from CelebA-Spoof, OULU-NPU, and the cross-dataset academic protocols 

Why this benchmark exists

If you ask an LLM for “the best open-source face liveness detection model,” you will get back some combination of the six models in this benchmark. They appear in GitHub tutorials, research baselines, thesis projects, and sometimes in production pipelines

What these recommendations almost never come with is data on how the models actually perform on the presentation attacks that define 2025 threat models: silicone masks, latex masks, wrapped paper 3D masks, textile 3D masks, cylinder attacks, modern replay attacks, paper masks

For this benchmark we used the Axon Labs presentation-attack datasets collection – an iBeta-aligned dataset family that we build and license for commercial and research use. It covers all 9 attack categories this article tests, across iBeta Levels 1 and 2, with professionally-captured videos and images from multiple subjects, devices, lighting conditions, and materials. Running the benchmark on a high-quality, attack-diverse dataset is what makes the cross-model comparison meaningful, a small or narrowly-collected test set would hide exactly the failure modes we want to surface

The protocol: 2,829-frame dataset

All six models were evaluated on the same dataset:

Partition
Frames
Spoof (attack)
1,205
Real (bona fide)
1,624
Total
2,829

The 6 models tested

Model
Year
Author / Lab
Architecture
Training data
2020
Minivision (industry)
MiniFASNet ensemble
CelebA-Spoof + proprietary
2019
Idiap Research Institute
DenseNet-161 + pixel map
OULU-NPU Protocol 2
2020
SenseTime + BUPT (ECCV 2020)
ResNet-18 + multi-task
CelebA-Spoof
2022
Intel OpenVINO Open Model Zoo
MobileNetV3
CelebA-Spoof
2023
Shanghai Jiao Tong (CVPR 2023)
Custom Framework + DKG
Idiap + CASIA + MSU-MFSD
2024
He et al. (CVPR 2024 Workshop)
Swin Transformer V2 Base
Joint physical-digital

Metrics: ISO/IEC 30107-3

We use the same metrics that the DHS Remote Identity Validation Rally (RIVR) – the most authoritative independent evaluation of commercial face liveness systems, applies to PAD certification:

  • APCER (Attack Presentation Classification Error Rate) – share of attacks misclassified as real, “how many fake attacks slipped past the model.” Lower is better. 
  • BPCER (Bona Fide Presentation Classification Error Rate) – share of real faces misclassified as spoof, “how many real users the model wrongly rejected.” Lower is better
  • ACER – (APCER + BPCER) / 2, the average of APCER and BPCER. Single-number summary

Example: A model with APCER 70% and BPCER 5% catches only 30% of attacks but accepts 95% of real users – good user experience, bad security. A model with APCER 5% and BPCER 50% catches 95% of attacks but rejects half of real users – strong security, broken user experience. Both have similar ACER (~35-38%) but very different real-world usability. That’s why we report all three metrics separately

Aggregate results

We aggregate attack categories into two standard classes. These classes map directly to iBeta certification levels 1 and 2 – the two certification tiers that define the industry standard for face anti-spoofing compliance

Class A - iBeta Level 1 equivalent (commodity attacks)

Low-cost, mass-producible attacks built from widely-available materials. These are the attacks any attacker with a smartphone and a printer can create in minutes

  • Paper-based attacks – printed photos, cutouts, cylinder-rolled prints, paper masks worn on a real actor’s face
  • Replay attacks – photos or videos displayed on a second phone screen or a computer monitor
Class B - iBeta Level 2 equivalent (3D material attacks)

Higher-fidelity physical attacks requiring specialized materials and fabrication. These are what a determined attacker with a mask-making budget produces

  • Silicone masks – flexible 3D masks with skin-like texture
  • Latex masks – 3D realistic masks  
  • Textile 3D face masks – fabric-based wearable 3D masks
  • Wrapped 3D paper masks – shaped-paper 3D face replicas

iBeta Level 3 attacks (resin casts, high-fidelity custom masks) are covered in Axon Labs’ iBeta 3 dataset but are not tested in this benchmark, they are outside the design envelope of every current open-source model

Values below are weighted APCER per class and other common metrics

Model
APCER (A) ↓
APCER (B) ↓
APCER ↓
BPCER ↓
ACER ↓
1
35.3%
74.2%
44.8%
2.6%
23.7%
2
66.4%
94.2%
73.3%
3.1%
38.2%
40.0%
66.4%
46.5%
56.62%
51.55%
68.3%
88.5%
73.3%
33.3%
53.3%
97.9%
99.3%
98.3%
13.7%
56.0%
IADG (CVPR 2023)
42.3%
23.1%
37.5%
83.4%
60.4%

What each model is, in one paragraph

Short descriptions of the six models we benchmarked. Detailed numbers are in the tables above

The most-starred open-source face liveness project on GitHub. A lightweight ensemble of two MiniFASNet variants, Apache-licensed, trained on CelebA-Spoof plus proprietary Minivision data. Designed for fast passive liveness on mobile devices (80×80 input, no action required from the user). Strong baseline for 2D replay threat models; does not see 3D masks in training

The reference pixel-wise binary supervision model from Idiap Research Institute (Switzerland). DenseNet-161 backbone plus a 14×14 auxiliary pixel-wise liveness map. We used the authors’ official weights (OULU_Protocol_2_model_0_0.onnx) trained on OULU-NPU Protocol 2 (2017). Architecturally sound but bounded by the training-era attack distribution

The canonical classifier shipped with the CelebA-Spoof dataset. ResNet-18 backbone with multi-task auxiliary heads (live attribute, attack type, lighting, depth, reflection). Designed to leverage rich CelebA-Spoof annotations for better generalization; in practice the auxiliary heads did not transfer well across attack distributions

MobileNetV3-based binary classifier from Intel’s OpenVINO Open Model Zoo, originally developed in kprokofi/light-weight-face-anti-spoofing. 12 MB ONNX, 0.15 GFlops, trained on CelebA-Spoof. Industrial-grade packaging with documented preprocessing; tuned aggressively for high spoof detection at the cost of real-user friction

“Instance-Aware Domain Generalization for Face Anti-Spoofing.” Custom Framework architecture with Cross-Style Assembly (CSA) + Dynamic Kernel Generator (DKG) modules plus auxiliary depth and reflection heads. Trained across Idiap Replay-Attack + CASIA-FASD + MSU-MFSD (cross-dataset protocol) to be domain-invariant. Despite its domain-generalization design, IADG is the worst-performing model in our benchmark: its score distributions for real faces and 3D masks (silicone, latex, textile) are nearly identical, so no threshold meaningfully separates them. The model is a clean demonstration that cross-dataset domain generalization across 2017-era academic datasets does not transfer to a 2025 attack distribution

The top-ranked entry in the CVPR 2024 Face Anti-Spoofing Workshop Challenge. Swin Transformer V2 Base backbone trained jointly on physical and digital presentation attacks (including deepfakes). Our best-performing model overall. Production-acceptable BPCER (2.59%) combined with the strongest Class A attack detection in the benchmark

Root cause analysis: the 2-source gap that won't close

Look at the per-category APCER table. Notice the pattern:

  • Models trained on CelebA-Spoof / OULU-NPU (Silent-FAS, AENet, anti-spoof-mn3, DeepPixBiS): strong on print/replay (their training distribution), fail on silicone/latex/textile/wrapped 3D masks (out of distribution).
  • IADG (cross-dataset domain generalization across Idiap+CASIA+MSU): the model intended to solve exactly this generalization problem, but its score distributions for real faces and 3D masks are nearly indistinguishable on our data. Domain generalization across 2017-era academic datasets does not transfer to modern attacks.
  • CVPR 2024 (joint physical-digital training, Swin V2): the best balance in the benchmark — strong on real faces and Class A attacks, but still weak on high-fidelity silicone (12.5%) and latex (12.0%).

The pattern is consistent across 6 years and 6 architectures: open-source training datasets do not adequately cover the modern presentation-attack distribution that defines real-world threats and iBeta certification protocols. No combination of architecture (ResNet, DenseNet, MobileNet, Swin Transformer, custom domain-generalization frameworks) and no level of multi-task supervision closes this gap when the training corpus does not contain enough silicone, latex, textile, and wrapped 3D paper attacks to learn from

Train on modern attack data

The durable open-source path. Modern open-source models can match commercial accuracy if trained on data that covers modern attacks at scale

Axon Labs publishes 9 categories of iBeta-aligned face presentation-attack datasets specifically designed to close the gap this benchmark exposes:

Plus iBeta level-1 (paper + replay), iBeta level-2 (silicone + latex + textile), and iBeta level-3 (high-fidelity, resin) collections aligned to the certification protocol structure

FAQ

Based on this benchmark, the CVPR 2024 Workshop Challenge winner (Swin Transformer V2 Base, He et al.) has the best aggregate ACER (23.68%) and is the only open-source model with production-acceptable BPCER (2.59%)

No. iBeta Level 1 and Level 2 PAD certification require essentially 0% APCER (Attack Presentation Classification Error Rate) across every attack category. The best open-source model in our benchmark achieves 35.3% APCER on Class A (Level 1 equivalent) and 23.1% APCER on Class B (Level 2 equivalent) - both far above the certification threshold

  • APCER (Attack Presentation Classification Error Rate): share of spoof attacks misclassified as real. Lower = better. 0% means every attack is detected
  • BPCER (Bona Fide Presentation Classification Error Rate): share of real users misclassified as spoof. Lower = better. 0% means no real user is rejected
  • ACER (Average Classification Error Rate): (APCER + BPCER) / 2. Single-number summary.=

These metrics are defined in ISO/IEC 30107-3 and are the industry standard used by both iBeta certification and the DHS Remote Identity Validation Rally (RIVR)

  • iBeta Level 1 = commodity attacks anyone can produce in minutes: printed photos, paper masks, screen replays, cutouts. In our benchmark, this maps to Class A (910 frames)
  • iBeta Level 2 = 3D material attacks requiring specialized fabrication: silicone masks, latex masks, textile 3D masks, wrapped 3D paper. In our benchmark, this maps to Class B (295 frames)
  • iBeta Level 3 = high-fidelity custom-built masks (resin casts, theatrical-grade silicone). Not tested in this benchmark; covered by separate Axon Labs iBeta 3 datasets

The training datasets behind every popular open-source model: CelebA-Spoof, OULU-NPU, Replay-Attack, MSU-MFSD, were collected in 2017-2020 and emphasize 2D attacks (print, replay) plus simple paper masks. Modern silicone, latex, textile, and wrapped 3D paper masks are under-represented or absent. A model cannot detect attacks it has never seen. The fix is training data, not architecture

For limited-threat-model use cases (replay-only attacks, internal demos, research baselines) - yes, with caveats. CVPR 2024 Challenge winner or Silent-FAS work with BPCER below 5%.

For production identity verification with regulatory requirements (iBeta certification, banking onboarding, government identity) - no. None of the open-source models tested meets iBeta Level 1 or Level 2 thresholds. You will need either (a) a RIVR-certified commercial SDK, or (b) to train a custom model on modern attack data

The Axon Labs v3 dataset (1,205 spoof frames across 9 iBeta-aligned attack categories + 1,624 real faces from 4 sources) is built from licensed Axon Labs presentation-attack datasets covering iBeta Levels 1 and 2. Higher-fidelity Level 3 datasets (resin, ultra-realistic silicone) are also available. License inquiries: axonlab.ai

Accelerate Your AI Development Today

Speed up your AI projects with our high-quality, ready-to-use datasets. Enjoy easy integration, fast deployment, and reliable biometric data collection

© 2022 – 2026 Copyright protected